Towards a Vocabulary for Data Quality Management in Semantic Web Architectures
Upcoming SlideShare
Loading in...5
×
 

Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

on

  • 4,721 views

 

Statistics

Views

Total Views
4,721
Views on SlideShare
3,732
Embed Views
989

Actions

Likes
2
Downloads
41
Comments
0

2 Embeds 989

http://semwebquality.org 988
http://paper.li 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Towards a Vocabulary for Data Quality Management in Semantic Web Architectures Presentation Transcript

  • 1. Towards a Vocabulary for DQM in Semantic Web Architectures (Research in Progress) Christian Fürber and Martin Hepp christian@fuerber.com, mhepp@computer.orgPresentation @ 1st International Workshop on Linked Web Data Management, March 25th, 2011, Uppsala, Sweden
  • 2. Part 1: What‘s the Problem?C. Fürber, M. Hepp: 2Towards a Vocabulary for DQMIn SemWeb Architectures
  • 3. Various Data Quality Problems Inconsistent duplicates Invalid characters Missing classification Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ Incorrect reference Approximate duplicates Reference: Linking Open Data cloud diagram, by Character alignment violation Word transpositions Invalid substrings Mistyping / Misspelling errors Cardinality violation Missing values Referential integrity violation Misfielded values Unique value violation False values Functional Dependency Out of range values Violation Imprecise values Existence of Homonyms Meaningless values Incorrect classification Existence of Synonyms Contradictory relationships Outdated conceptual elements Untyped literals Outdated valuesC. Fürber, M. Hepp: 3Towards a Vocabulary for DQMin SemWeb Architectures
  • 4. The Problem Negative Population Weird Population Values Invalid URL‘s Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparqlC. Fürber, M. Hepp: 4Towards a Vocabulary for DQMin SemWeb Architectures
  • 5. Part 2: What are high quality data?C. Fürber, M. Hepp: 5Towards a Vocabulary for DQMIn SemWeb Architectures
  • 6. What is Data Quality?• Data‘s „fitness for use by data consumers“ (Wang, Strong 1996)• „Conformance to specification“ (Kahn et al. 2002)• „Data are of high quality if they are fit for their intended uses in operations, decision making, and planning. Data are fit for use if they are free of defects and possess desired features.“ (Redman 2001) • Requirements as „Benchmark“C. Fürber, M. Hepp: 6Towards a Vocabulary for DQMin SemWeb Architectures
  • 7. Perspective-Neutral Data Quality Data quality is the degree to which data fulfills quality requirements …no matter who makes the quality requirements.C. Fürber, M. Hepp: 7Towards a Vocabulary for DQMIn SemWeb Architectures
  • 8. Quality- Requirements The Problem Population cannot be Negative negative Population Population is indicated by numeric values Weird Population Values URL‘s usually start with http://, https://, etc. Invalid URL‘s Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparqlC. Fürber, M. Hepp: 8Towards a Vocabulary for DQMin SemWeb Architectures
  • 9. Satisfying Quality Requirements Problem 3: Satisfying Requirements Desired State Individuals Status Quo = Desired State Groups Desired State Standards, etc. Problem 2: Harmonizing Requirements Problem 1: Expressing Quality RequirementsC. Fürber, M. Hepp: 9Towards a Vocabulary for DQMIn SemWeb Architectures
  • 10. Part 3: Research GoalC. Fürber, M. Hepp: 10Towards a Vocabulary for DQMIn SemWeb Architectures
  • 11. Major Research Goal • Represent Quality-Relevant information for automated… – Data Quality Monitoring – Data Quality Assessment – Data Cleansing – Filtering of High Quality Data …in a standardized vocabulary.C. Fürber, M. Hepp: 11Towards a Vocabulary for DQMin SemWeb Architectures
  • 12. Motives for DQM-Vocabulary• Support people to explicitly express data quality requirements in „same language“ on Web-Scale• Support the creation of consensual agreements upon quality requirements• Reduce effort for DQM-Activities• Raise transparency about assumed quality requirements• Enable consistency checks among quality requirementsC. Fürber, M. Hepp: 12Towards a Vocabulary for DQMIn SemWeb Architectures
  • 13. Part 4: Our ApproachC. Fürber, M. Hepp: 13Towards a Vocabulary for DQMIn SemWeb Architectures
  • 14. Basic Architecture Assessment HQ Data Problem Scores Retrieval Cleansed Classification Data SPARQL-Query-Engine DQM-Vocabulary Knowledgebase RDB A RDB B Data AcquisitionC. Fürber, M. Hepp: 14Towards a Vocabulary for DQMin SemWeb Architectures
  • 15. Main Concepts of DQM-Vocabulary Classify Quality Express Problems Requirements Annotate Quality Scores Express Cleansing Account for Tasks Task-Dependent RequirementsC. Fürber, M. Hepp: 15Towards a Vocabulary for DQMIn SemWeb Architectures
  • 16. Data Quality Problem Types: Source for Potential Requirements Inconsistent duplicates Invalid characters Missing classification Incorrect reference Character alignment violation Approximate duplicates Word transpositions Invalid substrings Mistyping / Misspelling errors Cardinality violation Missing values Referential integrity violation Misfielded values Unique value violation False values Functional Dependency Out of range values Violation Imprecise values Existence of Homonyms Meaningless values Incorrect classification Existence of Synonyms Contradictory relationships Outdated conceptual elements Outdated valuesC. Fürber, M. Hepp:Towards a Vocabulary for DQM 16in SemWeb Architectures
  • 17. Data Quality Requirements Syntactical Rules Semantic Rules Redundancy Rules Completeness Rules Timeliness RulesC. Fürber, M. Hepp: 17Towards a Vocabulary for DQMIn SemWeb Architectures
  • 18. Quality-Influencing Artifacts Current Focus of DQM-Vocabulary DataC. Fürber, M. Hepp: 18Towards a Vocabulary for DQMIn SemWeb Architectures
  • 19. Design Alternatives: Statements about Classes & Properties(1) Using classes and properties as subjects(2) Using datatype properties with xsd:anyURI(3) Mapping class and property URI‘s to new URI‘sC. Fürber, M. Hepp: 19Towards a Vocabulary for DQMIn SemWeb Architectures
  • 20. Part 5: Application ExamplesC. Fürber, M. Hepp: 20Towards a Vocabulary for DQMIn SemWeb Architectures
  • 21. Example 1: Legal Value Rule (1/3) What instances have illegal values for property foo:country ?C. Fürber, M. Hepp: 21Towards a Vocabulary for DQMIn SemWeb Architectures
  • 22. Example 1: Legal Value Rule (2/3) dqm:LegalValueRule Class Instance Literal value foo:LegalValueRule_1 “tref:Countries“ “foo:Countries“ “tref:countryName“ “foo:countryName“C. Fürber, M. Hepp: 22Towards a Vocabulary for DQMIn SemWeb Architectures
  • 23. Example 1: Legal Value Rule (3/3)C. Fürber, M. Hepp: 23Towards a Vocabulary for DQMIn SemWeb Architectures
  • 24. Example 2: DQ-Assessment (1/2) How syntactically accurate are all properties that are subject to LegalValueRules?C. Fürber, M. Hepp: 24Towards a Vocabulary for DQMIn SemWeb Architectures
  • 25. Example 2: DQ-Assessment (2/2)C. Fürber, M. Hepp: 25Towards a Vocabulary for DQMIn SemWeb Architectures
  • 26. Part 6: Conclusions & Planned WorkC. Fürber, M. Hepp: 26Towards a Vocabulary for DQMIn SemWeb Architectures
  • 27. Advantages of DQM-Voabulary• Minimizes human effort for DQM• Web-Scale sharing/reuse of data quality requirements• Consistency checks among data quality requirements• Transparency about applied data quality rulesC. Fürber, M. Hepp: 27Towards a Vocabulary for DQMIn SemWeb Architectures
  • 28. Limitations• Representation of complex functional dependency rules and derivation rules• Limited experience on real world-data sets• Currently no own concepts for classes and properties• Research still in progressC. Fürber, M. Hepp: 28Towards a Vocabulary for DQMIn SemWeb Architectures
  • 29. Future Work• Evaluation of design alternatives• Development of processing framework• Representation of more complex functional dependency rules / derivation rules• Extension of DQM-Vobulary• Evaluation on real-world data sets• Publication at http://semwebquality.orgC. Fürber, M. Hepp: 29Towards a Vocabulary for DQMin SemWeb Architectures
  • 30. Christian Fürber Researcher E-Business & Web Science Research Group Werner-Heisenberg-Weg 39 85577 Neubiberg Germany skype c.fuerber email christian@fuerber.com web http://www.unibw.de/ebusiness homepage http://www.fuerber.com twitter http://www.twitter.com/cfuerberPaper available at http://bit.ly/gYEDdQ 30