Successfully reported this slideshow.
Your SlideShare is downloading. ×

Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 30 Ad
Advertisement

More Related Content

Recently uploaded (20)

Advertisement

Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

  1. 1. Towards a Vocabulary for DQM in Semantic Web Architectures (Research in Progress) Christian Fürber and Martin Hepp christian@fuerber.com, mhepp@computer.org Presentation @ 1st International Workshop on Linked Web Data Management, March 25th, 2011, Uppsala, Sweden
  2. 2. Part 1: What‘s the Problem? C. Fürber, M. Hepp: 2 Towards a Vocabulary for DQM In SemWeb Architectures
  3. 3. Various Data Quality Problems Inconsistent duplicates Invalid characters Missing classification Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ Incorrect reference Approximate duplicates Reference: Linking Open Data cloud diagram, by Character alignment violation Word transpositions Invalid substrings Mistyping / Misspelling errors Cardinality violation Missing values Referential integrity violation Misfielded values Unique value violation False values Functional Dependency Out of range values Violation Imprecise values Existence of Homonyms Meaningless values Incorrect classification Existence of Synonyms Contradictory relationships Outdated conceptual elements Untyped literals Outdated values C. Fürber, M. Hepp: 3 Towards a Vocabulary for DQM in SemWeb Architectures
  4. 4. The Problem Negative Population Weird Population Values Invalid URL‘s Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql C. Fürber, M. Hepp: 4 Towards a Vocabulary for DQM in SemWeb Architectures
  5. 5. Part 2: What are high quality data? C. Fürber, M. Hepp: 5 Towards a Vocabulary for DQM In SemWeb Architectures
  6. 6. What is Data Quality? • Data‘s „fitness for use by data consumers“ (Wang, Strong 1996) • „Conformance to specification“ (Kahn et al. 2002) • „Data are of high quality if they are fit for their intended uses in operations, decision making, and planning. Data are fit for use if they are free of defects and possess desired features.“ (Redman 2001) • Requirements as „Benchmark“ C. Fürber, M. Hepp: 6 Towards a Vocabulary for DQM in SemWeb Architectures
  7. 7. Perspective-Neutral Data Quality Data quality is the degree to which data fulfills quality requirements …no matter who makes the quality requirements. C. Fürber, M. Hepp: 7 Towards a Vocabulary for DQM In SemWeb Architectures
  8. 8. Quality- Requirements The Problem Population cannot be Negative negative Population Population is indicated by numeric values Weird Population Values URL‘s usually start with http://, https://, etc. Invalid URL‘s Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql C. Fürber, M. Hepp: 8 Towards a Vocabulary for DQM in SemWeb Architectures
  9. 9. Satisfying Quality Requirements Problem 3: Satisfying Requirements Desired State Individuals Status Quo = Desired State Groups Desired State Standards, etc. Problem 2: Harmonizing Requirements Problem 1: Expressing Quality Requirements C. Fürber, M. Hepp: 9 Towards a Vocabulary for DQM In SemWeb Architectures
  10. 10. Part 3: Research Goal C. Fürber, M. Hepp: 10 Towards a Vocabulary for DQM In SemWeb Architectures
  11. 11. Major Research Goal • Represent Quality-Relevant information for automated… – Data Quality Monitoring – Data Quality Assessment – Data Cleansing – Filtering of High Quality Data …in a standardized vocabulary. C. Fürber, M. Hepp: 11 Towards a Vocabulary for DQM in SemWeb Architectures
  12. 12. Motives for DQM-Vocabulary • Support people to explicitly express data quality requirements in „same language“ on Web-Scale • Support the creation of consensual agreements upon quality requirements • Reduce effort for DQM-Activities • Raise transparency about assumed quality requirements • Enable consistency checks among quality requirements C. Fürber, M. Hepp: 12 Towards a Vocabulary for DQM In SemWeb Architectures
  13. 13. Part 4: Our Approach C. Fürber, M. Hepp: 13 Towards a Vocabulary for DQM In SemWeb Architectures
  14. 14. Basic Architecture Assessment HQ Data Problem Scores Retrieval Cleansed Classification Data SPARQL-Query-Engine DQM-Vocabulary Knowledgebase RDB A RDB B Data Acquisition C. Fürber, M. Hepp: 14 Towards a Vocabulary for DQM in SemWeb Architectures
  15. 15. Main Concepts of DQM-Vocabulary Classify Quality Express Problems Requirements Annotate Quality Scores Express Cleansing Account for Tasks Task-Dependent Requirements C. Fürber, M. Hepp: 15 Towards a Vocabulary for DQM In SemWeb Architectures
  16. 16. Data Quality Problem Types: Source for Potential Requirements Inconsistent duplicates Invalid characters Missing classification Incorrect reference Character alignment violation Approximate duplicates Word transpositions Invalid substrings Mistyping / Misspelling errors Cardinality violation Missing values Referential integrity violation Misfielded values Unique value violation False values Functional Dependency Out of range values Violation Imprecise values Existence of Homonyms Meaningless values Incorrect classification Existence of Synonyms Contradictory relationships Outdated conceptual elements Outdated values C. Fürber, M. Hepp: Towards a Vocabulary for DQM 16 in SemWeb Architectures
  17. 17. Data Quality Requirements Syntactical Rules Semantic Rules Redundancy Rules Completeness Rules Timeliness Rules C. Fürber, M. Hepp: 17 Towards a Vocabulary for DQM In SemWeb Architectures
  18. 18. Quality-Influencing Artifacts Current Focus of DQM-Vocabulary Data C. Fürber, M. Hepp: 18 Towards a Vocabulary for DQM In SemWeb Architectures
  19. 19. Design Alternatives: Statements about Classes & Properties (1) Using classes and properties as subjects (2) Using datatype properties with xsd:anyURI (3) Mapping class and property URI‘s to new URI‘s C. Fürber, M. Hepp: 19 Towards a Vocabulary for DQM In SemWeb Architectures
  20. 20. Part 5: Application Examples C. Fürber, M. Hepp: 20 Towards a Vocabulary for DQM In SemWeb Architectures
  21. 21. Example 1: Legal Value Rule (1/3) What instances have illegal values for property foo:country ? C. Fürber, M. Hepp: 21 Towards a Vocabulary for DQM In SemWeb Architectures
  22. 22. Example 1: Legal Value Rule (2/3) dqm:LegalValueRule Class Instance Literal value foo:LegalValueRule_1 “tref:Countries“ “foo:Countries“ “tref:countryName“ “foo:countryName“ C. Fürber, M. Hepp: 22 Towards a Vocabulary for DQM In SemWeb Architectures
  23. 23. Example 1: Legal Value Rule (3/3) C. Fürber, M. Hepp: 23 Towards a Vocabulary for DQM In SemWeb Architectures
  24. 24. Example 2: DQ-Assessment (1/2) How syntactically accurate are all properties that are subject to LegalValueRules? C. Fürber, M. Hepp: 24 Towards a Vocabulary for DQM In SemWeb Architectures
  25. 25. Example 2: DQ-Assessment (2/2) C. Fürber, M. Hepp: 25 Towards a Vocabulary for DQM In SemWeb Architectures
  26. 26. Part 6: Conclusions & Planned Work C. Fürber, M. Hepp: 26 Towards a Vocabulary for DQM In SemWeb Architectures
  27. 27. Advantages of DQM-Voabulary • Minimizes human effort for DQM • Web-Scale sharing/reuse of data quality requirements • Consistency checks among data quality requirements • Transparency about applied data quality rules C. Fürber, M. Hepp: 27 Towards a Vocabulary for DQM In SemWeb Architectures
  28. 28. Limitations • Representation of complex functional dependency rules and derivation rules • Limited experience on real world-data sets • Currently no own concepts for classes and properties • Research still in progress C. Fürber, M. Hepp: 28 Towards a Vocabulary for DQM In SemWeb Architectures
  29. 29. Future Work • Evaluation of design alternatives • Development of processing framework • Representation of more complex functional dependency rules / derivation rules • Extension of DQM-Vobulary • Evaluation on real-world data sets • Publication at http://semwebquality.org C. Fürber, M. Hepp: 29 Towards a Vocabulary for DQM in SemWeb Architectures
  30. 30. Christian Fürber Researcher E-Business & Web Science Research Group Werner-Heisenberg-Weg 39 85577 Neubiberg Germany skype c.fuerber email christian@fuerber.com web http://www.unibw.de/ebusiness homepage http://www.fuerber.com twitter http://www.twitter.com/cfuerber Paper available at http://bit.ly/gYEDdQ 30

×