CIRSS (Center for Informatics Research in Science and Scholarship) Seminar talk given on Sept. 19, 2014 at GSLIS, UIUC.
http://cirssweb.lis.illinois.edu/Events/eventDetails.php?id=214
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Euler: A Logic‐Based Toolkit for Aligning & Reconciling Multiple Taxonomic Perspectives
1. Euler:
A
Logic-‐Based
Toolkit
for
Aligning
and
Reconciling
Mul:ple
Taxonomic
Perspec:ves
Mingmin
Chen1
Shizhuo
Yu1
Parisa
Kianmajd1
Nico
Franz2
Shawn
Bowers3
Bertram
Ludäscher
4
1
Dept.
of
Computer
Science
,
University
of
California,
Davis
2
School
of
Life
Sciences,
Arizona
State
University
3
Dept.
of
Computer
Science,
Gonzaga
University
4
GSLIS
&
NCSA,
University
of
Illinois
at
Urbana-‐Champaign
2. Outline
• Meet
Nico,
Curator
of
Insects
• TAP:
The
Taxonomy
Alignment
Problem
• Euler/X
– Logic
Inside!
(X
in
FOL,
RCC,
ASP)
• Related
Projects
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
2
3. Meet
Prof.
Nico
Franz:
Curator
of
Insects
@
ASU
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
3
4. What
Nico
et
al.
do
for
a
living
…
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
4
5. Use
Case:
Perelleschus
sec.
2001
&
2006
Perelleschus
salpinflexus
sec.
Franz
&
Cardona-‐Duque
(2013)
DOI:10.1080/14772000.2013.806371
1
Input
ar:cula:ons:
Franz
&
Cardona-‐Duque.
2013.
Descripaon
of
two
new
species
and
phylogeneac
reassessment
of
Perelleschus
Wibmer
&
O'Brien,
1986
(Coleoptera:
Curculionidae),
with
a
complete
taxonomic
concept
history
of
Perelleschus
sec.
Franz
&
Cardona-‐Duque,
2013.
2013.
Systema5cs
and
Biodiversity
11:
209–236.
Merge
analyses:
Franz
et
al.
2014.
Reasoning
over
taxonomic
change:
exploring
alignments
for
the
Perelleschus
use
case.
PLoS
ONE.
(in
press)
1
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
5
6. T1:
Goal:
Align
two
phylogenies
with
differen:al
taxon
sampling
Perelleschus
sec.
2001
• Phylogeneac
revision
• 8
ingroup
species
concepts
• 2
outgroup
concepts
• 18
concepts
total
Source:
Nico
Franz.
Explaining
taxonomy's
legacy
to
computers
–
how
and
why?
Naming
Diversity
in
the
21st
Natural
History,
U
of
Colorado,
9/30/2014.
T2:
The
Meaning
of
Names:
Century,
Museum
of
Perelleschus
sec.
2006
• Exemplar
analysis
• 2
ingroup
species
concepts
• 1
outgroup
concept
• 7
concepts
total
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
6
7. What
Nico
does
for
a
living
(cont’d):
The
Indoors
Part
• Go
fun
places,
find
new
bugs,
study
them
…
– “Bugs-‐R-‐Us”
(see
taxonbytes.org)
• Now:
Compare,
align
and
revise
taxonomies,
based
on
careful
observaaon,
“character”
data,
experase
…
• Formally:
– Input:
T1
+
T2
(taxonomies)
+
A
(expert
ar3cula3ons)
– Output:
revised,
“merged”
taxonomy
(-‐ies)
T3
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
7
8. Taxonomy Alignment Problem (TAP)
T1
• Given:
T2
– Taxonomies T1 , T2
• incl. constraints (coverage, disjointness)
– Set of articulations (an alignment) A
• Find:
– Combined (“merged”) taxonomy T3 (= T1 + T2 + A)
• Is it a taxonomy? Or a DAG?
– Optional:
• Final alignment (should be minimal)
T3
A
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
8
9. Real Example: Turn this …
1.16
1.17
1.20
2.40
< OR ==
1.18
1.19
2.41
==
1.14
1.15
2.36
!
2.38
< OR ==
2.39
==
1.12
1.13
1.12L
!
2.37
==
1.11
2.42
==
2.43
==
1.27
== 2.50
1.23
1.24
1.25
2.53
> OR !
2.52
> OR !
2.47
< OR ==
2.54
> OR !
1.22
2.46
==
1.21
2.45
==
2.44
< OR ==
1.26
== 2.49
2.48
==
2.51
==
2.35
2.36L
Nodes
1 18
2 21
Edges
isa_1 17
isa_2 20
Art. 20
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
9
11. So
how
does
it
work?
• If
you
have
3
concepts
A,
B,
and
C.
• Assume
you
know
something
about
– A
óR1
B
(e.g.
R1:
A
is
a
subset
of
B)
– B
óR2
C
(e.g.,
R2:
B
is
disjoint
from
C)
• Now
what
can
you
say
about
this:
– A
óR3
C
• Yes
??
• …
it
follows
that
R3:
A
is
disjoint
from
C!
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
11
12. Ar:cula:on
Language
(RCC-‐5)
• How
does
the
expert
express
the
known
(or
assumed)
relaaonship
between
taxa
A
and
B?
• How
can
A
and
B
be
related?
• Use
basic
set
constraints
(B5):
– A
=
B
(equals
EQ)
(==)
– A
<
B
(proper
part
of
PP)
(<)
– A
>
B
(inverse
proper
part
of
IPP)
(>)
– A
o
B
(paraally
overlaps
PO)
(><)
– A
!
B
(disjoint
“region”
DR)
(!)
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
12
13. Taxonomies
and
Ar:cula:ons
in
Euler
There
are
32
(=
25)
possible
disjunc:ons
for
represenang
par:al
informa:on.
A
taxonomy
T
is
a
triple
(N,
≼,
ϕ)
with
names
(taxa)
N,
a
paraal
order
(is-‐a)
≼,
and
taxonomic
constraints
ϕ.
• Sibling
Disjointness:
sibling
taxa
do
not
overlap
• (Parent)
Coverage:
The
union
of
the
children
“covers”
the
parent
è
no
“missing”
children
A
B
(iv)
par5al
overlap
A
B
(ii)
proper
part
B
A
(iii)
Inverse
proper
part
A
B
(i)
congruence
A
B
(v)
disjointness
An
ar:cula:on
is
a
relaaon
(set-‐constraint)
between
taxa
A
and
B.
One,
and
only
one,
of
the
following
base
relaaons
B5
must
hold:
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
13
14. R32
lahce
of
32
(=25)
disjunc:ons
over
B5
= < > o !
(TRUE)
Level 5
(tautology)
= < o ! = < > o = > o ! = < > ! < > o !
Level 4
= < o = o ! = < ! < o ! = > o = < > = > ! < > o > o ! < > !
= o = < = ! < o o ! < ! = > > o < > > !
= o < ! >
∅
(FALSE)
= EQ(x,y) Equals
< PP(x,y) Proper Part of
> iPP(x,y) Inverse Proper Part
o PO(x,y) Partially Overlaps
! DR(x,y) Disjoint from
Level 1
(BASE-5 relations)
Level 3
Level 2
Level 0
(contradiction)
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
14
15. • … Aristotle …
• … Euler …
• …
• … Greg Whitbread …
• [BPB93] J. H. Beach, S. Pramanik, and J. H. Beaman. Hierarchic
taxonomic databases.,Advances in Computer Methods for Systematic
Biology: Artificial Intelligence, Databases, Computer Vision, 1993
• [Ber95] Walter G. Berendsohn. The concept of “potential taxa” in
databases. Taxon, 44:207–212, 1995.
• [Ber03] Walter G. Berendsohn. MoReTax – Handling Factual Information
Linked to Taxonomic Concepts in Biology. No. 39 in Schriftenreihe für
Vegetationskunde. Bundesamt für Naturschutz, 2003.
• [GG03] M. Geoffroy and A. Güntsch. Assembling and navigating the
potential taxon graph. In [Ber03], pages 71–82, 2003.
• [TL07] Thau, D., & Ludäscher, B. (2007). Reasoning about taxonomies
in first-order logic. Ecological Informatics, 2(3), 195-209.
• [FP09] Franz, N. M., & Peet, R. K. (2009). Perspectives: towards a
language for mapping relationships among taxonomic concepts.
Systematics and Biodiversity, 7(1), 5-20.
• …
15
Some History
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
16. What’s in a name? Euler Diagrams
• Project named after Euler Diagrams:
IF A is-a B
AND C and B are disjoint
------------------------------------
THEN: A and C are disjoint!
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
16
17. Euler
Diagrams
asTrees
(or
Graphs)
A
containment
hierarchy
(taxonomy)
An
equivalent
graph
(w/
transi5ve
edges)
same
informa:on
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
17
18. Represent
Phylogenies
as
Trees
…
T1:
Perelleschus
sec.
2001
• Phylogeneac
revision
• 8
ingroup
species
concepts
• 2
outgroup
concepts
• 18
concepts
total
1.16
1.17 1.20
1.18 1.19
1.14
1.15
1.12
1.13 1.12L
1.11
1.27
1.23
1.25 1.24
1.22 1.21
1.26
2.37 2.42 B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
18
30. …
and
PW2
are
the
only
solu:ons!
1.16
What
happened?
2.40
1.14
2.44
2.47
2.38
1.11
2.35
1.20
2.52
1.23
2.53
2.54
1.17
2.41
1.22
2.46
1.25
2.48
1.12
2.36
1.26
2.49
1.13
2.37
1.18
2.42
1.19
2.43
1.15
2.39
1.21
2.45
1.12L
2.36L
1.27
2.50
1.24
2.51
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
30
31. TAP:
Possible
Outcomes
1.a
1.b
isa
1.c
isa
2.d
< 2.f
< 2.e
=
<
isa
isa
Input
Alignment
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
31
36. Euler/X
Toolkit
and
Workflow
• FO
reasoning
about
taxonomies
(MFOL)
• Earlier:
CleanTax
– Prover9/Mace4
• Now:
Euler
– ASP
Reasoners
(DLV,
Clingo)
– Specialized
reasoners
(PyRCC)
– …
– X
=
ASP,
RCC,
…
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
36
37. Reducing Ambiguity
Possible Worlds
(PWs) View
Aggregate
View (AV)
Cluster View
(CV)
Explore!
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
37
38. Common
Outcome:
Inconsistency!
1.a
1.b
isa
1.c
isa
2.d
< 2.f
< 2.e
=
<
isa
isa
Input
Alignment
{A1, A2, A3, A4}
Black-‐Box
Provenance
{A1, A2, A3} {A1, A2, A4} {A1, A3, A4} {A2, A3, A4}
{A1, A2} {A1, A3} {A2, A3} {A1, A4} {A2, A4} {A3, A4}
{A1} {A2} {A3} {A4}
{ }
Inconsistent!
è
Diagnosis
(Reiter)
=
• Need
to
debug
the
input
araculaaons
è
(black-‐box)
diagnosis!
• Focus:
– How
do
we
efficiently
compute
the
diagnosac
lauce?
• Also:
– How
to
visualize..
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
38
39. A
Hybrid
Diagnosis
Approach
Combining
Black-‐Box
and
White-‐Box
Reasoning
Mingmin
Chen1
Shizhuo
Yu1
Nico
Franz2
Shawn
Bowers3
Bertram
Ludäscher
4
1
Department
of
Computer
Science
,
University
of
California,
Davis
2
School
of
Life
Sciences,
Arizona
State
University
3
Department
of
Computer
Science,
Gonzaga
University
4
GSLIS
&
NCSA,
University
of
Illinois
at
Urbana-‐Champaign
40. Example
Instance
(from
syntheac
benchmark
suite)
• Here:
N
=
10
taxa
in
T1,
T2
• Euler/X
finds:
inconsistent!
• è
diagnos:c
lahce
of
210
=
1024
nodes
è Find
minimal
inconsistent
subset
(MIS)
è maximal
consistent
subset
(MCS)
..
è show
to
user!
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
40
41. Visualizing
Diagnoses
N
=
10
araculaaons
è
210
=
1024
node
diagnosac
lauce
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
41
42. Bener
Idea:
Just
show
MIS,
MCS
N
=
4
araculaaons
è
24
=
16
node
diagnosac
lauce,
but
3
MCS
and
2
MIS
are
enough!
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
42
43. Visualizing
Diagnoses
..
but
4
MCS
and
1
MIC
tell
it
all!
1024
node
lauce
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
43
44. Visualizing
Diagnoses
Example
from
RuleML’14
paper:
N=12
è
4096
nodes
..
but
7
MCS
and
5
MIC
tell
it
all!
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
44
45. Black-Box Inconsistency Analysis
(Diagnostic Lattice)
• Then:
What happens if you
can’t have all (here: 4)
articulations together?
– Repair: find & revise minimal inconsistent subsets (Min-Incons)
– Expand: find maximal consistent subsets (Max-Cons) & revise outs
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
45
46. Inconsistency Analysis (Diagnostic Lattice)
• Black-‐box
Analysis
(Hiung
Set
algo.)
yields
a
• The Min-Incons (MIS) and
Max-Cons (MCS) sets
determine all others
è Repair MIS and/or Expand
MCS
Diagnosis
(lauce)
– for
n=4
araculaaons,
there
are
168
possible
diagnoses
– depending
on
expected
“red/green
areas”
è
explore
space
differently
• |araculaaons|
=
n
è
|possible
diagnoses|
=
|monotonic
Boolean
funcaons|
=
Dedekind
Number
(n):
2,
3,
6,
20,
168,
7581,
7828354,
...
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
46
47. Improving
Diagnosis
• Reiter’s
“black-‐box”
(model-‐based)
diagnosis
helps
debug
the
araculaaons
• Limited
scalability
(inherent
complexity)
• But
every
bit
helps:
– Hiung
Set
Algorithm
(“logarithmic
extracaon”)
• Our
idea:
– Exploit
“white-‐box”
reasoning
informaaon
è
RULES
to
the
rescue
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
47
48. Key
Idea:
exploit
white-‐box
info
• We
use
Answer
Set
Programming
(ASP)
to
solve
Taxonomy
Alignment
Problem
(TAP)
• Inconsistency
=
“False”
is
derived
in
the
head:
False
:-‐
<denial
of
integrity
constraint>
• Apply
provenance
trick
from
databases
J
– What
araculaaons
contribute
to
a
derivaaon
of
“False”
?
– Eliminate
those
that
don’t!
è
an
example
of
reusing
inferences
across
separate
black-‐
box
tests!
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
48
49. The
Provenance
“Trick”
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
49
50. Hybrid
Provenance
A3:
c
<
f
Black-‐box
Provenance
1.a
1.b
isa
1.c
isa
2.d
< 2.f
< 2.e
=
<
isa
isa
Input
Alignment
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
50
51. Hybrid
Provenance
A3:
c
<
f
Black-‐box
Provenance
r7: d = e ∪ f
< 2.f
< 2.e
A1: a = d
a = e ∪ f
A1+A2
+
…
=>
r3: a = b ∪ c
r4: b ∩ c = ∅ r8: e ∩ f = ∅ A2: b < e
f < c
f
<
c
White-‐box
Provenance
1.a
1.b
isa
1.c
isa
2.d
=
<
isa
isa
Input
Alignment
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
51
52. The
Hybrid
Approach
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
52
53. Hybrid
Approach
What
ar5cula5ons
contribute
to
some
inconsistency?
Good
old
black-‐box
(HST)
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
53
54. Benchmark
Results
• White-‐box
<
Hybrid
<
Black-‐box
(runames)
• Note:
white-‐box
does
not
give
you
a
diagnosis
• Potassco
<
DLV
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
54
57. Summary:
Hybrid
Diagnosis
• ASP
rules
can
be
used
to
efficiently
solve
real-‐
world
taxonomy
reasoning
problems
• Reiter’s
diagnosis
useful
to
debug
inconsistent
alignments
• Adding
a
“white-‐box”
provenance
approach
speeds
up
state-‐of-‐the-‐art
HST
algorithm
by
elimina:ng
independent
ar:cula:ons
• Future
work:
– Further
improvements,
including
parallelism:
• Trade-‐off
with
sharing
inferences
across
parallel
instances
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
57
58. Related
Projects
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
58
59. The Data Life Cycle
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
59
60. Data Quality & Curation Workflows
• Collections & occurrence data is
all over the map
– … literally (off the map!)
• Issues:
– Lat/Long transposition,
coordinate & projection issues
– Data entry/creation, “fuzzy”
data, naming issues, bit rot,
data conversions and
transformations, schema
mappings, … (you name it)
• Filtered-Push Collaboration
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
60
62. Filtered-Push: Kurator (Data Curation Workflows)
Tianhong
Song
Sven
Köhler
Lei
Dou
(former
member)
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
62
63. From Tool Users to Tool Makers
Screen capture…
back to the original definition
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
63
64. Theory
meets
Prac:ce
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
64
65. Under
the
hood:
Logic
(ASP)
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
65
66. Summary
&
Invita:on
• Building
open
source
tools
for
– Euler:
Reasoning
about
taxonomies
(&
data
integraaon)
– Kurator:
Data
Curaaon
workflows
• …
and
other
scienafic
workflows
• Topic
not
covered:
– (Game)
Theory
of
Provenance
(DAIS
talk
@CS,
10/7/2014)
• Looking
for:
– new
collaborators,
students,
..
• Let’s
meet!
– ludaesch@illinois.edu
B.
Ludäscher
Euler:
Reasoning
about
Taxonomies
CIRSS
Seminar
9/19/2014
66