Using Semantic Web Resources for Data Quality Management

2,020 views
1,875 views

Published on

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,020
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
31
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Using Semantic Web Resources for Data Quality Management

  1. 1. Using Semantic Web Resources for Data Quality Management Christian Fürber and Martin Hepp christian@fuerber.com, mhepp@computer.org Presentation at the 17th International Conference on Knowledge Engineering and Knowledge Management, October 10-15, 2010, Lisbon, Portugal
  2. 2. Purpose of Data Measurement Information & Knowledge 101010101 010101010 DATA 101010101 001010101 Automation 001010101 Decisions C. Fürber, M. Hepp: 2 Using SemWeb Resources for DQM
  3. 3. Data Quality in Practice Reference: http://www.heise.de/newsticker/meldung/Comdirect-Bank-macht-Kunden-zu-Billiardaeren-996088.html C. Fürber, M. Hepp: 3 Using SemWeb Resources for DQM
  4. 4. The Web of Messy Data? Retrieved from http://dbpedia.org/sparql on July 20th Which one is the correct population? C. Fürber, M. Hepp: 4 Using SemWeb Resources for DQM
  5. 5. The Web of Messy Data? Retrieved from http://dbpedia.org/sparql on July 20th Places with negative population?!? C. Fürber, M. Hepp: 5 Using SemWeb Resources for DQM
  6. 6. Risk of Failure Measurement Information & Knowledge 101010101 010101010 DATA 101010101 001010101 Automation 001010101 Decisions C. Fürber, M. Hepp: 6 Using SemWeb Resources for DQM
  7. 7. Data Quality Problem Types Inconsistent duplicates Invalid characters Missing classification Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ Incorrect reference Approximate duplicates Reference: Linking Open Data cloud diagram, by Character alignment violation Word transpositions Invalid substrings Mistyping / Misspelling errors Cardinality violation Missing values Referential integrity violation Misfielded values Unique value violation False values Functional Dependency Out of range values Violation Imprecise values Existence of Homonyms Meaningless values Incorrect classification Existence of Synonyms Contradictory relationships Outdated conceptual elements Untyped literals Outdated values C. Fürber, M. Hepp: 7 Using SemWeb Resources for DQM
  8. 8. Goals • Use Semantic Web data to identify data quality problems on instance level • Support Data Quality Management (DQM) process C. Fürber, M. Hepp: 8 Using SemWeb Resources for DQM
  9. 9. Total Data Quality Management for and based on the Semantic Web Develop and Define what‘s apply SPARQL good and / or queries based what‘s poor Define Measure on DQ- data quality Definition DQ Improve Analyze Reference: Richard Wang (1998) C. Fürber, M. Hepp: 9 Using SemWeb Resources for DQM
  10. 10. How can the Semantic Web support Data Quality Management? Availability of FREE Data Quality Knowledge, e.g. for the identification of… • Legal value violations • Functional dependency violations C. Fürber, M. Hepp: 10 Using SemWeb Resources for DQM
  11. 11. Using Trusted References Las Vegas France DQ-Constraints local:Location tref:Location Las Vegas Las Vegas France USA Tested Knowledgebase Trusted Reference C. Fürber, M. Hepp: 11 Using SemWeb Resources for DQM
  12. 12. Basic Architecture C. Fürber, M. Hepp: 12 Using SemWeb Resources for DQM
  13. 13. Basic Characteristics of SPIN • Allows definition of generalized SPARQL query templates http://spinrdf.org/ • Constraint checking based on SPARQL • Definition of inferencing rules via SPARQL C. Fürber, M. Hepp: 13 Using SemWeb Resources for DQM
  14. 14. Generic Data Quality Constraints Library for Easy DQ-Defintion • Mandatory properties & literals • Legal values* • Legal value ranges • Functional dependencies* • Legal syntaxes • Uniqueness * Designed to use trusted references available @ http://semwebquality.org/ontologies/dq-constraints# C. Fürber, M. Hepp: 14 Using SemWeb Resources for DQM
  15. 15. Definition of Data Quality Constraints based on SPIN C. Fürber, M. Hepp: 15 Using SemWeb Resources for DQM
  16. 16. Constraint checking in Practice C. Fürber, M. Hepp: 16 Using SemWeb Resources for DQM
  17. 17. Legal Value Constraints Return all instances of class vcard:Address that do not have a matching value for property vcard:country-name in property tref:country SELECT ?s WHERE { ?s a vcard:Address . ?s vcard:country-name ?value . OPTIONAL { ?s2 a tref:Location . ?s2 tref:country ?value1 . } . FILTER(str(?value1)!= str(?value)) } C. Fürber, M. Hepp: 17 Using SemWeb Resources for DQM
  18. 18. Functional Dependency Constraints Return all instances of vcard:ADR with city-country-combinations that do not have a matching pair in instances of gn:Location. SELECT ?s WHERE { ?s a gr:LocationOfSalesOrServiceProvisioning . ?s vcard:ADR ?node ?node vcard:city ?value1 . ?node vcard:country ?value2 . NOT EXISTS { ?s2 a gn:Location . ?s2 gn:asciiname ?value1 . ?s2 gn:country ?value2 . }} C. Fürber, M. Hepp: 18 Using SemWeb Resources for DQM
  19. 19. Acquisition of Semantic Web Sources for DQM (1) Replication of relevant knowledge-bases (2) On the fly via federated SPARQL queries: PREFIX dbo:<http://dbpedia.org/ontology/> SELECT * WHERE { ?s1 :location_CITY ?city . OPTIONAL{ SERVICE <http://dbpedia.org/sparql>{ ?s2 a dbo:City . ?s2 rdfs:label ?city . FILTER (lang(?city) = "en") . } } FILTER(!bound(?s2)) } C. Fürber, M. Hepp: 19 Using SemWeb Resources for DQM
  20. 20. Limitations • High degree of uncertainty about quality of Semantic Web resources • Risk for data quality problem proliferation • Lack of Semantic Web resources for certain domains • Flexible design of RDF and structural heterogeneity complicate definition of generic DQ constraints • Scalability on large data sets • DQ constraints close the world C. Fürber, M. Hepp: 20 Using SemWeb Resources for DQM
  21. 21. Contributions • Data quality control for Semantic Web data • Identification of potential inconsistencies between Semantic Web Resources • Reduction of effort for the definition of functional dependency rules and legal value rules • Reuse of shared data quality rules on a Web scale C. Fürber, M. Hepp: 21 Using SemWeb Resources for DQM
  22. 22. Future Work • Semantic Web information quality assessment framework (SWIQA) with computation of KPI‘s • Analysis and identification of useful „trusted references“ based on SWIQA • Application on multi-source master data of information systems • Evaluation on large data sets C. Fürber, M. Hepp: 22 Using SemWeb Resources for DQM
  23. 23. Data Quality Constraints Library for SPIN @ http://semwebquality.org/ontologies/dq-constraints# Christian Fürber Researcher E-Business & Web Science Research Group Werner-Heisenberg-Weg 39 85577 Neubiberg Germany skype c.fuerber email christian@fuerber.com web http://www.unibw.de/ebusiness homepage http://www.fuerber.com twitter http://www.twitter.com/cfuerber Paper available at http://bit.ly/c5v6TM 23

×