Using Semantic Web Resources for Data Quality Management

  • 1,533 views
Uploaded on

 

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,533
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
28
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Using Semantic Web Resources for Data Quality Management Christian Fürber and Martin Hepp christian@fuerber.com, mhepp@computer.org Presentation at the 17th International Conference on Knowledge Engineering and Knowledge Management, October 10-15, 2010, Lisbon, Portugal
  • 2. Purpose of Data Measurement Information & Knowledge 101010101 010101010 DATA 101010101 001010101 Automation 001010101 Decisions C. Fürber, M. Hepp: 2 Using SemWeb Resources for DQM
  • 3. Data Quality in Practice Reference: http://www.heise.de/newsticker/meldung/Comdirect-Bank-macht-Kunden-zu-Billiardaeren-996088.html C. Fürber, M. Hepp: 3 Using SemWeb Resources for DQM
  • 4. The Web of Messy Data? Retrieved from http://dbpedia.org/sparql on July 20th Which one is the correct population? C. Fürber, M. Hepp: 4 Using SemWeb Resources for DQM
  • 5. The Web of Messy Data? Retrieved from http://dbpedia.org/sparql on July 20th Places with negative population?!? C. Fürber, M. Hepp: 5 Using SemWeb Resources for DQM
  • 6. Risk of Failure Measurement Information & Knowledge 101010101 010101010 DATA 101010101 001010101 Automation 001010101 Decisions C. Fürber, M. Hepp: 6 Using SemWeb Resources for DQM
  • 7. Data Quality Problem Types Inconsistent duplicates Invalid characters Missing classification Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ Incorrect reference Approximate duplicates Reference: Linking Open Data cloud diagram, by Character alignment violation Word transpositions Invalid substrings Mistyping / Misspelling errors Cardinality violation Missing values Referential integrity violation Misfielded values Unique value violation False values Functional Dependency Out of range values Violation Imprecise values Existence of Homonyms Meaningless values Incorrect classification Existence of Synonyms Contradictory relationships Outdated conceptual elements Untyped literals Outdated values C. Fürber, M. Hepp: 7 Using SemWeb Resources for DQM
  • 8. Goals • Use Semantic Web data to identify data quality problems on instance level • Support Data Quality Management (DQM) process C. Fürber, M. Hepp: 8 Using SemWeb Resources for DQM
  • 9. Total Data Quality Management for and based on the Semantic Web Develop and Define what‘s apply SPARQL good and / or queries based what‘s poor Define Measure on DQ- data quality Definition DQ Improve Analyze Reference: Richard Wang (1998) C. Fürber, M. Hepp: 9 Using SemWeb Resources for DQM
  • 10. How can the Semantic Web support Data Quality Management? Availability of FREE Data Quality Knowledge, e.g. for the identification of… • Legal value violations • Functional dependency violations C. Fürber, M. Hepp: 10 Using SemWeb Resources for DQM
  • 11. Using Trusted References Las Vegas France DQ-Constraints local:Location tref:Location Las Vegas Las Vegas France USA Tested Knowledgebase Trusted Reference C. Fürber, M. Hepp: 11 Using SemWeb Resources for DQM
  • 12. Basic Architecture C. Fürber, M. Hepp: 12 Using SemWeb Resources for DQM
  • 13. Basic Characteristics of SPIN • Allows definition of generalized SPARQL query templates http://spinrdf.org/ • Constraint checking based on SPARQL • Definition of inferencing rules via SPARQL C. Fürber, M. Hepp: 13 Using SemWeb Resources for DQM
  • 14. Generic Data Quality Constraints Library for Easy DQ-Defintion • Mandatory properties & literals • Legal values* • Legal value ranges • Functional dependencies* • Legal syntaxes • Uniqueness * Designed to use trusted references available @ http://semwebquality.org/ontologies/dq-constraints# C. Fürber, M. Hepp: 14 Using SemWeb Resources for DQM
  • 15. Definition of Data Quality Constraints based on SPIN C. Fürber, M. Hepp: 15 Using SemWeb Resources for DQM
  • 16. Constraint checking in Practice C. Fürber, M. Hepp: 16 Using SemWeb Resources for DQM
  • 17. Legal Value Constraints Return all instances of class vcard:Address that do not have a matching value for property vcard:country-name in property tref:country SELECT ?s WHERE { ?s a vcard:Address . ?s vcard:country-name ?value . OPTIONAL { ?s2 a tref:Location . ?s2 tref:country ?value1 . } . FILTER(str(?value1)!= str(?value)) } C. Fürber, M. Hepp: 17 Using SemWeb Resources for DQM
  • 18. Functional Dependency Constraints Return all instances of vcard:ADR with city-country-combinations that do not have a matching pair in instances of gn:Location. SELECT ?s WHERE { ?s a gr:LocationOfSalesOrServiceProvisioning . ?s vcard:ADR ?node ?node vcard:city ?value1 . ?node vcard:country ?value2 . NOT EXISTS { ?s2 a gn:Location . ?s2 gn:asciiname ?value1 . ?s2 gn:country ?value2 . }} C. Fürber, M. Hepp: 18 Using SemWeb Resources for DQM
  • 19. Acquisition of Semantic Web Sources for DQM (1) Replication of relevant knowledge-bases (2) On the fly via federated SPARQL queries: PREFIX dbo:<http://dbpedia.org/ontology/> SELECT * WHERE { ?s1 :location_CITY ?city . OPTIONAL{ SERVICE <http://dbpedia.org/sparql>{ ?s2 a dbo:City . ?s2 rdfs:label ?city . FILTER (lang(?city) = "en") . } } FILTER(!bound(?s2)) } C. Fürber, M. Hepp: 19 Using SemWeb Resources for DQM
  • 20. Limitations • High degree of uncertainty about quality of Semantic Web resources • Risk for data quality problem proliferation • Lack of Semantic Web resources for certain domains • Flexible design of RDF and structural heterogeneity complicate definition of generic DQ constraints • Scalability on large data sets • DQ constraints close the world C. Fürber, M. Hepp: 20 Using SemWeb Resources for DQM
  • 21. Contributions • Data quality control for Semantic Web data • Identification of potential inconsistencies between Semantic Web Resources • Reduction of effort for the definition of functional dependency rules and legal value rules • Reuse of shared data quality rules on a Web scale C. Fürber, M. Hepp: 21 Using SemWeb Resources for DQM
  • 22. Future Work • Semantic Web information quality assessment framework (SWIQA) with computation of KPI‘s • Analysis and identification of useful „trusted references“ based on SWIQA • Application on multi-source master data of information systems • Evaluation on large data sets C. Fürber, M. Hepp: 22 Using SemWeb Resources for DQM
  • 23. Data Quality Constraints Library for SPIN @ http://semwebquality.org/ontologies/dq-constraints# Christian Fürber Researcher E-Business & Web Science Research Group Werner-Heisenberg-Weg 39 85577 Neubiberg Germany skype c.fuerber email christian@fuerber.com web http://www.unibw.de/ebusiness homepage http://www.fuerber.com twitter http://www.twitter.com/cfuerber Paper available at http://bit.ly/c5v6TM 23