http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Towards maintainable constraint validation
and repair for taxonomies
- The PoolParty approach
Monika Solanki
https://w3id.org/people/msolanki
@nimonika
University of Oxford
Joint work with
Christian Mader
Fraunhofer IAIS, Germany
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
PoolParty (SWC) Use case
PoolParty(PPT): leading commercial taxonomy
management application, authoring tool for knowledge
graphs, provides taxonomy import functionality to
interact with third party datasets
Taxonomists using PPT integrate a variety of models,
schemata, ontologies and vocabularies into their
knowledge bases.
Challenge: combining varied data sources to ensure that
these data mashups at any time conform to a set of quality
heuristics, as expected by the data processing algorithms.
monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Motivation
Consuming and interlinking enterprise data and openly
available data within an industry setting.
Ensuring that the interlinked datasets confirm to a set of
quality heuristics.
Interactively detecting and repairing datasets with
constraint violations.
monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Ensuring Data Consistency
Current - checks to ensure that the data persisted in the triple
store do not violate it’s data consistency are scattered in the
code and sometimes performed multiple times.
Requirements
Provide a mechanism to specify data constraints in a
formal way,
Identify and analyse datasets that are imported into PPT
and are a source of constraint violations.
monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Constraint resolution
Current - checks to ensure that the data persisted in the triple
store do not violate it’s data consistency are scattered in the
code and sometimes performed multiple times.
Requirements
Provide a validation mechanism to check for constraint
violation and evaluate this against the selected datasets.
Combine formal data constraint definitions with reusable
repair strategies that can be easily applied by end-users in
a (semi-) automatic way.
monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Dataset selection
SWC-generated: Datasets for which a conversion to a
PPT-compatible taxonomy has been performed by SWC
(containing 10 datasets),
Custom-generated: Datasets for which a conversion to a
PPT-compatible taxonomy has been performed by
third-party institutions (containing 9 datasets), and
Web: Datasets that are using SKOS, but for which is
currently unknown if they are compatible with PPT
(containing 7 datasets).
monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Constraint specification
ConceptTypeAssertion (cta):
SELECT DISTINCT ?resource WHERE {
?resource skos:broader|skos:narrower ?otherRes.
FILTER NOT EXISTS {?resource a skos:Concept}}
HierarchicalConsistency (hc):
SELECT DISTINCT ?resource WHERE {
?resource a skos:Concept
FILTER NOT EXISTS {
?resource (skos:broader|^skos:narrower)*/skos:
topConceptOf ?parent.
?parent a skos:ConceptScheme.}}
monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Validation using SHACL
HierarchicalConsistency (hc):
ppts:ConceptShape
a sh:Shape;
sh:scopeClass skos:Concept;
sh:property [
a sh:PropertyConstraint;
sh:predicate skos:prefLabel;
sh:minCount 1;
sh:minLength 1;
sh:datatype rdf:langString;
sh:uniqueLang true];
sh:constraint [
a sh:Constraint;
a sh:OrConstraint;
sh:shapes (ppts:ConceptHasBroaderShape ppts:
ConceptIsTopConceptShape)].
monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Repair strategies
AddInverseStrategy
ppts:ConceptHavingBroader
a sh:Shape;
sh:scope [
a sh:Scope;
a sh:PropertyScope ;
sh:predicate skos:broader];
sh:inverseProperty [
a sh:InversePropertyConstraint;
sh:predicate skos:narrower;
sh:minCount 1;
rs:strategy [
a rs:AddInverseStrategy]].
monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Implementation
SHACL implementation (TopQuadrant), Sesame, SWC
libraries ⇒ Java application
SKOS data model, Dataset file, Constraint specification ⇒
Violation report
Violation report, SKOS data model, Dataset file, Constraint
specification ⇒ Triples changeset
Not yet Optimised for runtime performance
monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Validation results
cta was never violated in datasets converted to PPT
taxonomies.
upl is a SKOS-level constraint, better respected by
vocabulary providers.
Violations observed across all datasets.
monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Validation performance
Omitted 10 datasets that contained ≤ 50000 triples.
No correlation between the dataset size and time taken to
perform the validation.
Structure of the dataset makes a difference.
monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Repair strategy execution performance
Repair strategy applied to a special case of the constraint
br - BidirectionalRelationsHierarical.
Only considered skos:broaderThan and
skos:narrowerThan. Did not consider owl:inverse.
Repair scales well even with larger datasets.
monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe
Summary and Conclusions
Interwoven SHACL-based data consistency specification
and validation with repair strategies.
Validation of datasets generated by PPT can be done with
reasonable performance.
Integrating repair strategies and data constraint
specification helps in building a unified, maintainable
model.
The model also plays a pivotal role in harmonizing data
and software development processes.
monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies

Towards maintainable constraint validation and repair for taxonomies: The PoolParty approach

  • 1.
    http://aligned-project.eu COLD@ISWC, 18thOctober 2016, Kobe Towards maintainable constraint validation and repair for taxonomies - The PoolParty approach Monika Solanki https://w3id.org/people/msolanki @nimonika University of Oxford Joint work with Christian Mader Fraunhofer IAIS, Germany
  • 2.
    http://aligned-project.eu COLD@ISWC, 18thOctober 2016, Kobe PoolParty (SWC) Use case PoolParty(PPT): leading commercial taxonomy management application, authoring tool for knowledge graphs, provides taxonomy import functionality to interact with third party datasets Taxonomists using PPT integrate a variety of models, schemata, ontologies and vocabularies into their knowledge bases. Challenge: combining varied data sources to ensure that these data mashups at any time conform to a set of quality heuristics, as expected by the data processing algorithms. monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
  • 3.
    http://aligned-project.eu COLD@ISWC, 18thOctober 2016, Kobe Motivation Consuming and interlinking enterprise data and openly available data within an industry setting. Ensuring that the interlinked datasets confirm to a set of quality heuristics. Interactively detecting and repairing datasets with constraint violations. monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
  • 4.
    http://aligned-project.eu COLD@ISWC, 18thOctober 2016, Kobe Ensuring Data Consistency Current - checks to ensure that the data persisted in the triple store do not violate it’s data consistency are scattered in the code and sometimes performed multiple times. Requirements Provide a mechanism to specify data constraints in a formal way, Identify and analyse datasets that are imported into PPT and are a source of constraint violations. monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
  • 5.
    http://aligned-project.eu COLD@ISWC, 18thOctober 2016, Kobe Constraint resolution Current - checks to ensure that the data persisted in the triple store do not violate it’s data consistency are scattered in the code and sometimes performed multiple times. Requirements Provide a validation mechanism to check for constraint violation and evaluate this against the selected datasets. Combine formal data constraint definitions with reusable repair strategies that can be easily applied by end-users in a (semi-) automatic way. monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
  • 6.
    http://aligned-project.eu COLD@ISWC, 18thOctober 2016, Kobe Dataset selection SWC-generated: Datasets for which a conversion to a PPT-compatible taxonomy has been performed by SWC (containing 10 datasets), Custom-generated: Datasets for which a conversion to a PPT-compatible taxonomy has been performed by third-party institutions (containing 9 datasets), and Web: Datasets that are using SKOS, but for which is currently unknown if they are compatible with PPT (containing 7 datasets). monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
  • 7.
    http://aligned-project.eu COLD@ISWC, 18thOctober 2016, Kobe Constraint specification ConceptTypeAssertion (cta): SELECT DISTINCT ?resource WHERE { ?resource skos:broader|skos:narrower ?otherRes. FILTER NOT EXISTS {?resource a skos:Concept}} HierarchicalConsistency (hc): SELECT DISTINCT ?resource WHERE { ?resource a skos:Concept FILTER NOT EXISTS { ?resource (skos:broader|^skos:narrower)*/skos: topConceptOf ?parent. ?parent a skos:ConceptScheme.}} monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
  • 8.
    http://aligned-project.eu COLD@ISWC, 18thOctober 2016, Kobe Validation using SHACL HierarchicalConsistency (hc): ppts:ConceptShape a sh:Shape; sh:scopeClass skos:Concept; sh:property [ a sh:PropertyConstraint; sh:predicate skos:prefLabel; sh:minCount 1; sh:minLength 1; sh:datatype rdf:langString; sh:uniqueLang true]; sh:constraint [ a sh:Constraint; a sh:OrConstraint; sh:shapes (ppts:ConceptHasBroaderShape ppts: ConceptIsTopConceptShape)]. monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
  • 9.
    http://aligned-project.eu COLD@ISWC, 18thOctober 2016, Kobe Repair strategies AddInverseStrategy ppts:ConceptHavingBroader a sh:Shape; sh:scope [ a sh:Scope; a sh:PropertyScope ; sh:predicate skos:broader]; sh:inverseProperty [ a sh:InversePropertyConstraint; sh:predicate skos:narrower; sh:minCount 1; rs:strategy [ a rs:AddInverseStrategy]]. monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
  • 10.
    http://aligned-project.eu COLD@ISWC, 18thOctober 2016, Kobe Implementation SHACL implementation (TopQuadrant), Sesame, SWC libraries ⇒ Java application SKOS data model, Dataset file, Constraint specification ⇒ Violation report Violation report, SKOS data model, Dataset file, Constraint specification ⇒ Triples changeset Not yet Optimised for runtime performance monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
  • 11.
    http://aligned-project.eu COLD@ISWC, 18thOctober 2016, Kobe Validation results cta was never violated in datasets converted to PPT taxonomies. upl is a SKOS-level constraint, better respected by vocabulary providers. Violations observed across all datasets. monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
  • 12.
    http://aligned-project.eu COLD@ISWC, 18thOctober 2016, Kobe Validation performance Omitted 10 datasets that contained ≤ 50000 triples. No correlation between the dataset size and time taken to perform the validation. Structure of the dataset makes a difference. monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
  • 13.
    http://aligned-project.eu COLD@ISWC, 18thOctober 2016, Kobe Repair strategy execution performance Repair strategy applied to a special case of the constraint br - BidirectionalRelationsHierarical. Only considered skos:broaderThan and skos:narrowerThan. Did not consider owl:inverse. Repair scales well even with larger datasets. monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies
  • 14.
    http://aligned-project.eu COLD@ISWC, 18thOctober 2016, Kobe Summary and Conclusions Interwoven SHACL-based data consistency specification and validation with repair strategies. Validation of datasets generated by PPT can be done with reasonable performance. Integrating repair strategies and data constraint specification helps in building a unified, maintainable model. The model also plays a pivotal role in harmonizing data and software development processes. monika.solanki@cs.ox.ac.uk, @nimonika Constraint validation and repair for taxonomies