Swan(sea) Song – personal research during my six years at Swansea ... and bey...
LogMap: Large-scale, Logic-based and Interactive Ontology Matching
1. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
LogMap
Large-scale, Logic-based and Interactive
Ontology Matching
Ernesto Jiménez-Ruiz Bernardo Cuenca Grau
Yujiao Zhou Ian Horrocks
Department of Computer Science, University of Oxford
European Conference on Artificial Intelligence (ECAI)
29 August 2012
2. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Outline
Preliminaries
Challenges
LogMap Anatomy
Maximising recall
Maximising precision
Evaluation
3. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Ontologies and OWL (I)
Ontologies
• Formal representation of the knowledge of a domain.
OWL 2 Language
• Web Ontology language (OWL) is World Wide Web
Consortium (W3C) standard.
• OWL 2 corresponds to a decidable fragment of first-order
logic.
4. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Ontologies and OWL (II)
OWL 2 example axioms
• JuvenileArthritis v JuvenileDisease
• PolyArthritis ≡ Arthritis u > 5 affects.Joint
• Disease u Joint v ⊥
• JuvenileIdiopathicArthritis @ “Juvenile Rheumatoid Arthritis”
5. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Ontology mappings
Mappings are tuples he1, e2, n, ρi
• e1, e2 are entities in the O1 and O2
• n a confidence value between 0 and 1
• ρ is the semantic relationship between e1 and e2
Formalized as OWL 2 axioms
• Where the semantic relationship ρ is one of {≡, v, w, ⊥}
• No extra semantics
6. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Outline
Preliminaries
Challenges
LogMap Anatomy
Maximising recall
Maximising precision
Evaluation
7. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Challenges
Why ontogy matching tools?
• Ontologies are being developed by different groups, and
• Use different classifications and naming schemas.
• (Biomedical) ontologies may contain tends of thousands of
entities.
• FMA (78, 989 classes), NCI (66, 724 classes) or SNOMED
CT (306, 591 classes) are prominent examples.
8. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Challenges
Challenges to be addressed
• Sufficient scalability to deal with large ontologies
• Detect and repair errors.
• Reasoning with OU := O1 ∪ O2 ∪ M may lead to (a large
number of) unsatisfiable clases (i.e, OU |= A v ⊥)
• Reasoning and repairing OU aggravates scalability problem
• Logic-based but scalable techniques
• Involve the expert user (if accurate mappings are needed)
• Minimise number of requests
• Reduce delay between requests
9. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Challenges
Challenges to be addressed
• Sufficient scalability to deal with large ontologies
• Detect and repair errors.
• Reasoning with OU := O1 ∪ O2 ∪ M may lead to (a large
number of) unsatisfiable clases (i.e, OU |= A v ⊥)
• Reasoning and repairing OU aggravates scalability problem
• Logic-based but scalable techniques
• Involve the expert user (if accurate mappings are needed)
• Minimise number of requests
• Reduce delay between requests
10. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Challenges
Challenges to be addressed
• Sufficient scalability to deal with large ontologies
• Detect and repair errors.
• Reasoning with OU := O1 ∪ O2 ∪ M may lead to (a large
number of) unsatisfiable clases (i.e, OU |= A v ⊥)
• Reasoning and repairing OU aggravates scalability problem
• Logic-based but scalable techniques
• Involve the expert user (if accurate mappings are needed)
• Minimise number of requests
• Reduce delay between requests
11. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Outline
Preliminaries
Challenges
LogMap Anatomy
Maximising recall
Maximising precision
Evaluation
12. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Our approach in a nutshell
LogMap . . .
• can efficiently match semantically rich ontologies containing
tens (and even hundreds) of thousands of classes;
• incorporates sophisticated reasoning and repair capabilities;
• provides support for user intervention during the matching
process.
13. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
LogMap Antomy
LogMap can be divided in . . .
• Stage 1: maximising recall.
• The goal is to reduce search space
• and extract an overestimation of the mappings
• Stage 2: maximising precision.
• The goal is to return a set of (precise) mappings
• not leading to many logical inconsistencies.
14. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Outline
Preliminaries
Challenges
LogMap Anatomy
Maximising recall
Maximising precision
Evaluation
15. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Lexical indexation and mapping computation
Inverted Files
• The ontology lexicon is indexed in an inverted file (IF)
• Each entry in the IF is a “set” of words corresponding to
exact or partial entity labels
Inverted Index for FMA Ids for FMA class URIs
Index entry Class ids Class id Class URI (namespace omitted)
{ acinus } 6953,7661,8171 6953 Mixed acinus
{ hepatic,acinus } 8171 7661 Serious acinus
{ acinus,mixed } 6953 8171 Hepatic acinus
{ serious,acinus } 7661 1170 Branch of common cochlear artery
{ common,branch,artery } 1170,7842 7842 Branch of common interosseous artery
Inverted Index for NCI Ids for NCI class URIs
Index entry Class ids Class id Class URI (namespace omitted)
{ acinus } 18081 18081 Liver acinus
{ liver } 18081 8087 Common iliac artery branch
{ acinus,liver } 18081 27727 Common femoral artery branch
{ common,branch,artery } 1204,8087,27727 1204 Common carotid artery branch
16. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Lexical indexation and mapping computation
Intersection of inverted files
• As a result we obtain an overestimation of the candidate
mappings (M?).
• This step condiderably reduces the search space (e.g.
19,151 for FMA-NCI with Recall=0.93 and Precision=0.14)
• Note that, most of them will turn out to be incorrect (i.e.
Serious acinus ≡ Liver acinus)
17. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Overlapping estimation
• LogMap extracts two fragments (O0
1 and O0
2) representing
the overlapping between the input ontologies (via M?)
• Logic-based modularization techniques are used
• Characteristics:
• Correct mappings are unlikely to involve classes outside these
fragments
• The use of fragments is key for the scalability challenge
18. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Outline
Preliminaries
Challenges
LogMap Anatomy
Maximising recall
Maximising precision
Evaluation
19. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Identifying reliable mappings
We select Mr ⊆ M?
Based on. . .
• High lexical similarity (using the string matcher ISUB)
• A principle of locality
• Correct mappings (C1 ≡ C2) are likely to have similar scopes
(classes semantically related)
• E.g. 2,281 (out of 19,151) reliable mappings for FMA-NCI
(P=0.91)
20. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Identifying reliable mappings
FMA:Trapezoid ≡ NCI:Trapezoid (non reliable) vs
FMA:Trapezoid ≡ NCI:TrapezoidBone (reliable)
21. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Reasoning with the reliable mappings (Mr )
Detecting an repairing unsatisfiabilities
• Mr are tipically very precise but may lead to many
unsatifiabilities (> 600 for FMA-NCI)
• LogMap implements efficient methods to repair most of them
(only misses two cases for FMA-NCI)
23. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Reasoning with the reliable mappings (Mr )
Propositional Horn SAT with Dowling-Gallier (D-G)
• LogMap implements the SAT algorithm D-G
• D-G is called for every class C and the propositional theory
PC :
• the rule (true → C);
• the propositional representations P0
1 and P0
2 of the input
ontologies; and
• the propositional representation PM of the mappings.
24. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Reasoning with the reliable mappings (Mr )
Satisfiability of Smegma
25. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Reasoning with reliable mappings (Mr )
Mapping repair
• LogMap extends D-G to record conflictive mappings involved
in an unsatisfiability (e.g. {m4, m5, m6, m7}).
• LogMap implements a ‘greedy’ repair algorithm to compute
repairs for each unsatisfiability
• LogMap finds all repairs of “smallest” size.
• E.g.: R1 = {m4} and R2 = {m6}
• The repair with less confidence is selected.
26. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Reasoning with the reliable mappings (Mr )
Our class satisfiability algorithm is . . .
• sound
• If LogMap finds a class unsatisfiable, it is indeed unsatisfiable.
• worst-case linear in the size of the (classified) ontologies.
• incomplete, but incompleteness is mitigated:
• Most of the relevant non-propositional reasoning is already
performed when classifying input ontologies independently
• Mappings are Horn propositional axioms
• Most new entailments caused by the mappings likely to be
computable using Horn propositional reasoning only (only 2
cases missed out of 600 for FMA-NCI)
27. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Assessing M?
Mr
Semantic Index
• LogMap indexes P0
1, P0
2 and the repaired Mr are efficiently
indexed using an interval labelling schema.
• LogMap efficiently discards mappings in M? Mr that are
in conflict with semantics index.
28. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Assessing M?
Mr
Revision of Confidence values
• Co-occurence anaylisis
• NCI : Hepatic acinus v FMA : Liver Acinus?
• Principle of locality
User feedback
• Clear-cut mappings in M? Mr are either discarded or
included in the output.
• The rest are (optionally) given to the expert user.
29. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Assessing M?
Mr : user feedback
Feedback requests
• Many candidate mappings are discarded automatically in the
previous steps (more than 16,000 for FMA-NCI).
• The number or non clear-cuts may still be high (852 for
FMA-NCI)
User interaction in LogMap
• LogMap performs automatic actions based on user decisions
to reduce the number of remaining requests. Criteria:
• Ambiguity
• Conflicts with semantic index
• Delay to compute automatic questions is negligible
30. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Assessing M?
Mr : user feedback
Feedback requests
• Many candidate mappings are discarded automatically in the
previous steps (more than 16,000 for FMA-NCI).
• The number or non clear-cuts may still be high (852 for
FMA-NCI)
User interaction in LogMap
• LogMap performs automatic actions based on user decisions
to reduce the number of remaining requests. Criteria:
• Ambiguity
• Conflicts with semantic index
• Delay to compute automatic questions is negligible
31. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Final Diagnosis
Horn porpositional reasoning
• We perform a final repair step before returning the output
mappings (M).
OWL 2 reasoning (optional)
• Additionally we (optionally) check how clean is M using an
off-the-shelf OWL 2 reasoner
32. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Final Diagnosis
Horn porpositional reasoning
• We perform a final repair step before returning the output
mappings (M).
OWL 2 reasoning (optional)
• Additionally we (optionally) check how clean is M using an
off-the-shelf OWL 2 reasoner
33. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Outline
Preliminaries
Challenges
LogMap Anatomy
Maximising recall
Maximising precision
Evaluation
34. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Evaluation
Ontology Alignment Evaluation Campaign (OAEI)
http://oaei.ontologymatching.org/
• The OAIE is an annual international campaign for the
systematic evaluation of ontology matching systems
• LogMap has been one of the top tools in 2011 and 2011.5,
and
• currently is the unique matching systems to scale to large
ontologies and perform reasoning over their integration
35. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Evaluation
Matching FMA, NCI and SNOMED with LogMap
• OAEI Large BioMed track:
http://www.cs.ox.ac.uk/isg/projects/SEALS/oaei/
Ontologies |MGS |
Upper bound M? Reliable Mr Output M
|M?| P R |Mr | P R |M| P R ⊥
FMA-NCI 2,898 19,151 0.14 0.93 2,256 0.91 0.71 2,658 0.87 0.80 2
FMA-SNMD 8,111 67,592 0.09 0.74 4,929 0.84 0.51 6,313 0.80 0.62 0
SNMD-NCI 18,322 102,514 0.13 0.75 10,598 0.86 0.50 12,978 0.81 0.58 *
• GOMMA can also (successfully) cope with FMA-NCI
• P=0.85, R=0.78, F=0.81
• GOMMA mappings lead to > 5,000 unsatisfiable classes
• GOMMA matches FMA-NCI in 48min while LogMap in 4min
36. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Evaluation
Matching FMA, NCI and SNOMED with LogMap
• OAEI Large BioMed track:
http://www.cs.ox.ac.uk/isg/projects/SEALS/oaei/
Ontologies |MGS |
Upper bound M? Reliable Mr Output M
|M?| P R |Mr | P R |M| P R ⊥
FMA-NCI 2,898 19,151 0.14 0.93 2,256 0.91 0.71 2,658 0.87 0.80 2
FMA-SNMD 8,111 67,592 0.09 0.74 4,929 0.84 0.51 6,313 0.80 0.62 0
SNMD-NCI 18,322 102,514 0.13 0.75 10,598 0.86 0.50 12,978 0.81 0.58 *
• GOMMA can also (successfully) cope with FMA-NCI
• P=0.85, R=0.78, F=0.81
• GOMMA mappings lead to > 5,000 unsatisfiable classes
• GOMMA matches FMA-NCI in 48min while LogMap in 4min
37. Preliminaries Challenges LogMap Max Recall Max Precision Evaluation
Evaluation
User interaction in LogMap
• We have “simulated” the human expert using Gold Standard
mappings to return the correct answer with a given probability.
• We have matched more than 1,100 medium-sized modules of
NCI to (the whole of) FMA
• Number of feedback requests is manegeable
• LogMap (with not interaction) closely behaves as LogMap
with interaction and 30% error rate.
39. Conclusions and future work
• We aim at creating a suitable interface for user interaction
• Instance and property matching already included in
LogMap but still under development.
• We also intend to implement multilingual features
• LogMap is available for download:
http://www.cs.ox.ac.uk/isg/tools/LogMap/
• It also has a Web interface:
http://csu6325.cs.ox.ac.uk/
40. Questions?
We want you to. . .
• . . . test LogMap and give us feedback
• . . . provide us with your ontologies and use cases
Thank you for your attention
• LogMap Project:
http://www.cs.ox.ac.uk/isg/projects/LogMap/
• Web interface:
http://csu6325.cs.ox.ac.uk/
• ernesto.jimenez.ruiz@gmail.com