Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mappings Validation

782 views

Published on

Linked Data Quality assessment applied and integrated to the Linked Data generation and publication workflow. Presented at the Data Quality tutorial, satellite event at SEMANTICS2016.

Published in: Technology

Mappings Validation

  1. 1. Mappings Validation Data Quality Tutorial - SEMANTICS2016 Anastasia Dimou Anastasia.Dimou@ugent.be ● @natadimou Ghent University – iMinds
  2. 2. Linked (Open) Data semantically annotated & interlinked data using different vocabularies or ontologies published in the form of RDF datasets
  3. 3. Linked (Open) Data derive from originally heterogeneous (semi-)structured data e.g. Eurostat from TSV DBLP from DBLP database DBpedia from Wikipedia LinkedBrainz from MusicBrainz database ... … …
  4. 4. Linked Data Quality in the context of Linked Data generation and publication workflow
  5. 5. Linked Data Quality dimensions Representational dimension Intrinsic dimension Accessibility dimension Contextual dimension A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer. Quality Assessment for Linked Data: A Survey. Semantic Web Journal, 2016.
  6. 6. Linked Data Quality dimensions Representational dimension data modeling Intrinsic dimension Linked Data generation Accessibility dimension Linked Data publication Contextual dimension Linked Data consumption
  7. 7. Linked Data Quality dimensions Representational dimension data modeling Intrinsic dimension Linked Data generation Accessibility dimension Linked Data publishing Contextual dimension Linked Data consumption
  8. 8. Linked Data Quality - Intrinsic Dimension determines the RDF Dataset Quality by assessing it for possible violations with respect to accuracy (e.g. malformed datatype literals) consistency (e.g. disjoint classes/properties)
  9. 9. Instead of applying Quality Assessment to the already published Linked Data as part of Linked Data consumption Apply Quality Assessment to the Mappings that generate the Linked Data as part of Linked Data production
  10. 10. Linked Dataset Quality Assessment (DQA) Mappings Quality Assessment (MQA) Mapping & Dataset Quality Assessment Workflow Mappings & Quality Assessment Evaluation Results
  11. 11. Linked Dataset Quality Assessment (DQA) Mappings Quality Assessment (MQA) Mapping & Dataset Quality Assessment Workflow Mappings & Quality Assessment Evaluation Results
  12. 12. dbo:Person
  13. 13. dbo:Personxsd:date
  14. 14. dbo:Personxsd:date Linked Data Quality Assessment
  15. 15. Linked Data Quality Assessment (DQA) RDFUnit http://rdfunit.aksw.org test-driven data-debugging framework based on SPARQL-patterns D. Kontokostas, P. Westphal, S. Auer, S. Hellmann, J. Lehmann, R. Cornelissen, and A. J. Zaveri Test-driven evaluation of linked data quality. In Proceedings of the 23rd International Conference on World Wide Web
  16. 16. DQA with RDFUnit …WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) } …WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }
  17. 17. 10 domain violations 10 datatype violations
  18. 18. 1,000,000 domain violations!!! 1,000,000 datatype violations!!!
  19. 19. Linked Data Quality Assessment (DQA) Similar violations occur repeatedly within a single Linked Data set
  20. 20. Linked Data Quality Assessment (DQA) Sets of triples of a dataset have repetitive patterns
  21. 21. Linked Data Quality Assessment (DQA) Sets of triples of a dataset have repetitive patterns
  22. 22. DQA: Linked Data Quality Assessment is applied by third parties to already published Linked Data sets violations DQA
  23. 23. DQA: Linked Data Quality Assessment Adjustments is NOT applied at the root of the problem violations DQA
  24. 24. DQA: Linked Data Quality Assessment Adjustments are overwritten if a new version of the original data is annotated and published as Linked Data violations DQA
  25. 25. Instead of applying Quality Assessment to the already published Linked Data set as part of data consumption
  26. 26. Apply Quality Assessment to the Mappings that generate the Linked Data A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Helmann, R. Van De Walle Assessing and Refining Mappings to RDF to Improve Dataset Quality. In Proceedings of The Semantic Web - ISWC 2015
  27. 27. Linked Dataset Quality Assessment (DQA) Mappings Quality Assessment (MQA) Mapping & Dataset Quality Assessment Workflow Mappings & Quality Assessment Evaluation Results
  28. 28. Mapping languages formalize patterns into rules to generate Linked Data from some original data
  29. 29. RDF Mapping Language (RML) http://rml.io extends the W3C-recommended R2RML specify the mapping rules to generate Linked Data from heterogeneous data sources mapping rules are Linked Data sets too! A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, and R. Van de Walle. RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data. In Proceedings of the 7th Workshop on Linked Data on the Web (LDOW2014), 2014.
  30. 30. RDF Mapping Language (RML) http://rml.io <#Mapping> rr:subjectMap [ rr:class dbo:Event rr:template "http://example.com/{Name}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:gYear ] ] .
  31. 31. RDF Mapping Language (RML) http://rml.io
  32. 32. data map doc Mapping Processor RDF Mapping Language (RML) http://rml.io
  33. 33. data map doc Mapping Processor violations DQA DQA: Linked Data Quality Assessment
  34. 34. data map doc Mapping Processor violations DQA DQA: Linked Data Quality Assessment
  35. 35. data map doc Mapping Processor violations DQA DQA: Linked Data Quality Assessment
  36. 36. data map doc Mapping Processor violations MQA MQA: Mapping Quality Assessment
  37. 37. DQA with RDFUnit over RML …WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) } …WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }
  38. 38. D→MQA with RDFUnit over RML …WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) } …WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }
  39. 39. D→MQA with RDFUnit over RML …WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) } …WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) } <#Mapping> rr:subjectMap [ rr:class dbo:Event rr:template "http://example.com/{Name}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:gYear ] ] .
  40. 40. D→MQA with RDFUnit over RML …WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) } …WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) } … WHERE { ?resource rr:predicateObjectMap ?poMap. ?poMap rr:predicate %%P1%%; rr:objectMap ?objM. ?objM rr:datatype ?c. FILTER (?c != %%D1%%) } <#Mapping> rr:subjectMap [ rr:class dbo:Event rr:template "http://example.com/{Name}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:gYear ] ] .
  41. 41. data map doc Mapping Processor violations MQA MQA: Mapping Quality Assessment
  42. 42. MQA with RDFUnit over RML …WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) } …WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) } … WHERE { ?resource rr:predicateObjectMap ?poMap. ?poMap rr:predicate %%P1%%; rr:objectMap ?objM. ?objM rr:datatype ?c. FILTER (?c != %%D1%%) } <#Mapping> rr:subjectMap [ rr:class dbo:Event rr:template "http://example.com/{Name}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:gYear ] ] . 1 ONLY domain violations!!! 1 ONLY datatype violations!!!
  43. 43. data map doc Mapping Processor violations MDQA MDQA: Uniform Mapping & Dataset Quality Assessment
  44. 44. Linked Dataset Quality Assessment (DQA) Mappings Quality Assessment (MQA) Mapping & Dataset Quality Assessment Workflow Mappings & Quality Assessment Evaluation Results
  45. 45. MQA: Mapping Quality Assessment discover not only the violations but also their origin before they are even generated
  46. 46. MQA: Mapping Quality Assessment easily apply structural adjustments prevent same violations to appear repeatedly over distinct entities allow intuitively combining different ontologies and vocabularies
  47. 47. data map doc Mapping Processor violations MDQA MDQA: Uniform Mapping & Dataset Quality Assessment
  48. 48. data map doc Mapping Processor violations MDQA MDQA: Uniform Mapping & Dataset Quality Assessment <#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:gYear ; rut:missingValue xsd:date .
  49. 49. data map doc Mapping Processor Mapping Refinementsviolations MDQA Uniform Mapping & Dataset Quality Assessment Workflow
  50. 50. Correcting MQA violations with RML Editor
  51. 51. Correcting MQA violations with RML Editor
  52. 52. Correcting MQA violations with RML Editor
  53. 53. data map doc Mapping Processor violations MDQA MDQA: Uniform Mapping & Dataset Quality Assessment <#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:gYear ; rut:missingValue xsd:date . DEL: <#ObjectMap> rr:datatype xsd:gYear. ADD: <#ObjectMap> rr:datatype xsd:date.
  54. 54. MQA with RDFUnit over RML <#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:float ; rut:missingValue xsd:int . DEL: <#ObjectMap> rr:datatype xsd:gYear. ADD: <#ObjectMap> rr:datatype xsd:date. DEL: <#SubjectMap> rr:class dbo:Event. ADD: <#SubjectMap> rr:class dbo:Person.
  55. 55. MQA with RDFUnit over RML <#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:float ; rut:missingValue xsd:int . DEL: <#ObjectMap> rr:datatype xsd:gYear. ADD: <#ObjectMap> rr:datatype xsd:date. <#Mapping> rr:subjectMap [ rr:class dbo:Person rr:template "http://example.com/{Name}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:date ] ] . DEL: <#SubjectMap> rr:class dbo:Event. ADD: <#SubjectMap> rr:class dbo:Person.
  56. 56. data new map doc map doc Mapping Processor Mapping Refinementsviolations MDQA (optional) Uniform Mapping & Dataset Quality Assessment Workflow
  57. 57. data new map doc map doc Mapping Processor Mapping Refinementsviolations MDQA (optional) Uniform Mapping & Dataset Quality Assessment Workflow
  58. 58. Uniform Mapping & Dataset Quality Assessment Workflow
  59. 59. Mapping Quality Assessment: Limitations
  60. 60. Mapping Quality Assessment: Limitations certain test cases inevitably require the complete Linked Data set
  61. 61. Mapping Quality Assessment: Limitations certain test cases inevitably require the complete Linked Data set cardinality, functionality, symmetricity
  62. 62. Mapping Quality Assessment: Limitations certain test cases inevitably require the complete Linked Data set cardinality, functionality, symmetricity on Mappings defense: more data issue NOT affected by the mapping rules
  63. 63. Linked Dataset Quality Assessment (DQA) Mappings Quality Assessment (MQA) Mapping & Dataset Quality Assessment Workflow Mappings & Quality Assessment Evaluation Results
  64. 64. Dataset Vs Mapping Quality Assessment Number of Violations *Dbpedia and DBLP D2RQ Mappings were translated to RML mappings #violations - Quality Assessment Dataset Assessment Mappings Assessment DBpedia EN 3.2M 160 DBLP 8.1M 8 A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Helmann, R. Van De Walle Assessing and Refining Mappings to RDF to Improve Dataset Quality. In Proceedings of The Semantic Web - ISWC 2015
  65. 65. Dataset Vs Mapping Quality Assessment Time Dataset Quality Assessment Mappings Quality Assessment size time size time DBPedia EN 62M 16h 115K 11s DBPedia NL 21M 1.5h 53K 6s DBLP 12M 12h 368 12s A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Helmann, R. Van De Walle Assessing and Refining Mappings to RDF to Improve Dataset Quality. In Proceedings of The Semantic Web - ISWC 2015
  66. 66. Mapping Quality Assessment * http://mappings.dbpedia.org/validation Live update of DBpedia Mapping Quality Assessment results every night! ☺ Mapping Quality Assessment size time DBpedia EN 115K 11s DBpedia NL 53K 6s DBpedia All 511K 32s A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Helmann, R. Van De Walle Assessing and Refining Mappings to RDF to Improve Dataset Quality. In Proceedings of The Semantic Web - ISWC 2015
  67. 67. * http://mappings.dbpedia.org/validation DBpedia Mappings Quality Assessment A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Helmann DBpedia Mappings Quality Assessment. To be published in Proceedings of the 15th International Semantic Web Conference: Posters and Demos 2016 Live update of DBpedia Mapping Quality Assessment results every night! ☺
  68. 68. Linked Dataset Quality Assessment (DQA) Mappings Quality Assessment (MQA) Mapping & Dataset Quality Assessment Workflow Mappings & Quality Assessment Evaluation Results
  69. 69. Violations are related to the dataset's schema (vocabularies or ontologies) occur repeatedly within a single RDF dataset The situation aggravates the more ontologies and vocabularies are reused and combined
  70. 70. Linked Data Quality Assessment shifted from data consumption to data publication integrated systematically in the publishing workflow violations are identified, resolved and will not re-appear Linked Data of higher Quality is generated!!!
  71. 71. Mappings Validation Data Quality Tutorial - SEMANTICS2016 Anastasia Dimou Anastasia.Dimou@ugent.be ● @natadimou Ghent University – iMinds

×