Towards a Vocabulary for  DQM in Semantic Web      Architectures                 (Research in Progress)        Christian F...
Part 1:                      What‘s the Problem?C. Fürber, M. Hepp:                         2Towards a Vocabulary for DQMI...
Various Data Quality Problems                                                          Inconsistent duplicates            ...
The Problem                                                                                        Negative               ...
Part 2:        What are high quality data?C. Fürber, M. Hepp:                   5Towards a Vocabulary for DQMIn SemWeb Arc...
What is Data Quality?• Data‘s „fitness for use by data consumers“ (Wang, Strong 1996)• „Conformance to specification“ (Kah...
Perspective-Neutral Data Quality              Data quality is the degree to which               data fulfills quality requ...
Quality-   Requirements                                    The Problem                                    Population      ...
Satisfying Quality Requirements         Problem 3: Satisfying            Requirements            Desired                  ...
Part 3:                               Research GoalC. Fürber, M. Hepp:                            10Towards a Vocabulary f...
Major Research Goal • Represent Quality-Relevant information for   automated…                       – Data Quality Monitor...
Motives for DQM-Vocabulary• Support people to explicitly express data quality  requirements in „same language“ on Web-Scal...
Part 4:                               Our ApproachC. Fürber, M. Hepp:                           13Towards a Vocabulary for...
Basic Architecture                                 Assessment   HQ Data      Problem                      Scores     Retri...
Main Concepts of DQM-Vocabulary                               Classify Quality     Express                                ...
Data Quality Problem Types:          Source for Potential Requirements                                                    ...
Data Quality Requirements                                      Syntactical Rules                                      Sema...
Quality-Influencing Artifacts        Current Focus     of DQM-Vocabulary                                    DataC. Fürber,...
Design Alternatives:   Statements about Classes & Properties(1) Using classes and properties as subjects(2) Using datatype...
Part 5:                    Application ExamplesC. Fürber, M. Hepp:                        20Towards a Vocabulary for DQMIn...
Example 1: Legal Value Rule (1/3)               What instances have illegal values                 for property foo:countr...
Example 1: Legal Value Rule (2/3)                               dqm:LegalValueRule          Class                         ...
Example 1: Legal Value Rule (3/3)C. Fürber, M. Hepp:                        23Towards a Vocabulary for DQMIn SemWeb Archit...
Example 2: DQ-Assessment (1/2)               How syntactically accurate are all                 properties that are subjec...
Example 2: DQ-Assessment (2/2)C. Fürber, M. Hepp:                      25Towards a Vocabulary for DQMIn SemWeb Architectures
Part 6:                               Conclusions &                               Planned WorkC. Fürber, M. Hepp:         ...
Advantages of DQM-Voabulary• Minimizes human effort for DQM• Web-Scale sharing/reuse of data quality  requirements• Consis...
Limitations• Representation of complex functional  dependency rules and derivation rules• Limited experience on real world...
Future Work• Evaluation of design alternatives• Development of processing framework• Representation of more complex  funct...
Christian Fürber   Researcher   E-Business & Web Science Research Group                 Werner-Heisenberg-Weg 39          ...
Upcoming SlideShare
Loading in...5
×

Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

4,774

Published on

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,774
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
49
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

  1. 1. Towards a Vocabulary for DQM in Semantic Web Architectures (Research in Progress) Christian Fürber and Martin Hepp christian@fuerber.com, mhepp@computer.orgPresentation @ 1st International Workshop on Linked Web Data Management, March 25th, 2011, Uppsala, Sweden
  2. 2. Part 1: What‘s the Problem?C. Fürber, M. Hepp: 2Towards a Vocabulary for DQMIn SemWeb Architectures
  3. 3. Various Data Quality Problems Inconsistent duplicates Invalid characters Missing classification Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ Incorrect reference Approximate duplicates Reference: Linking Open Data cloud diagram, by Character alignment violation Word transpositions Invalid substrings Mistyping / Misspelling errors Cardinality violation Missing values Referential integrity violation Misfielded values Unique value violation False values Functional Dependency Out of range values Violation Imprecise values Existence of Homonyms Meaningless values Incorrect classification Existence of Synonyms Contradictory relationships Outdated conceptual elements Untyped literals Outdated valuesC. Fürber, M. Hepp: 3Towards a Vocabulary for DQMin SemWeb Architectures
  4. 4. The Problem Negative Population Weird Population Values Invalid URL‘s Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparqlC. Fürber, M. Hepp: 4Towards a Vocabulary for DQMin SemWeb Architectures
  5. 5. Part 2: What are high quality data?C. Fürber, M. Hepp: 5Towards a Vocabulary for DQMIn SemWeb Architectures
  6. 6. What is Data Quality?• Data‘s „fitness for use by data consumers“ (Wang, Strong 1996)• „Conformance to specification“ (Kahn et al. 2002)• „Data are of high quality if they are fit for their intended uses in operations, decision making, and planning. Data are fit for use if they are free of defects and possess desired features.“ (Redman 2001) • Requirements as „Benchmark“C. Fürber, M. Hepp: 6Towards a Vocabulary for DQMin SemWeb Architectures
  7. 7. Perspective-Neutral Data Quality Data quality is the degree to which data fulfills quality requirements …no matter who makes the quality requirements.C. Fürber, M. Hepp: 7Towards a Vocabulary for DQMIn SemWeb Architectures
  8. 8. Quality- Requirements The Problem Population cannot be Negative negative Population Population is indicated by numeric values Weird Population Values URL‘s usually start with http://, https://, etc. Invalid URL‘s Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparqlC. Fürber, M. Hepp: 8Towards a Vocabulary for DQMin SemWeb Architectures
  9. 9. Satisfying Quality Requirements Problem 3: Satisfying Requirements Desired State Individuals Status Quo = Desired State Groups Desired State Standards, etc. Problem 2: Harmonizing Requirements Problem 1: Expressing Quality RequirementsC. Fürber, M. Hepp: 9Towards a Vocabulary for DQMIn SemWeb Architectures
  10. 10. Part 3: Research GoalC. Fürber, M. Hepp: 10Towards a Vocabulary for DQMIn SemWeb Architectures
  11. 11. Major Research Goal • Represent Quality-Relevant information for automated… – Data Quality Monitoring – Data Quality Assessment – Data Cleansing – Filtering of High Quality Data …in a standardized vocabulary.C. Fürber, M. Hepp: 11Towards a Vocabulary for DQMin SemWeb Architectures
  12. 12. Motives for DQM-Vocabulary• Support people to explicitly express data quality requirements in „same language“ on Web-Scale• Support the creation of consensual agreements upon quality requirements• Reduce effort for DQM-Activities• Raise transparency about assumed quality requirements• Enable consistency checks among quality requirementsC. Fürber, M. Hepp: 12Towards a Vocabulary for DQMIn SemWeb Architectures
  13. 13. Part 4: Our ApproachC. Fürber, M. Hepp: 13Towards a Vocabulary for DQMIn SemWeb Architectures
  14. 14. Basic Architecture Assessment HQ Data Problem Scores Retrieval Cleansed Classification Data SPARQL-Query-Engine DQM-Vocabulary Knowledgebase RDB A RDB B Data AcquisitionC. Fürber, M. Hepp: 14Towards a Vocabulary for DQMin SemWeb Architectures
  15. 15. Main Concepts of DQM-Vocabulary Classify Quality Express Problems Requirements Annotate Quality Scores Express Cleansing Account for Tasks Task-Dependent RequirementsC. Fürber, M. Hepp: 15Towards a Vocabulary for DQMIn SemWeb Architectures
  16. 16. Data Quality Problem Types: Source for Potential Requirements Inconsistent duplicates Invalid characters Missing classification Incorrect reference Character alignment violation Approximate duplicates Word transpositions Invalid substrings Mistyping / Misspelling errors Cardinality violation Missing values Referential integrity violation Misfielded values Unique value violation False values Functional Dependency Out of range values Violation Imprecise values Existence of Homonyms Meaningless values Incorrect classification Existence of Synonyms Contradictory relationships Outdated conceptual elements Outdated valuesC. Fürber, M. Hepp:Towards a Vocabulary for DQM 16in SemWeb Architectures
  17. 17. Data Quality Requirements Syntactical Rules Semantic Rules Redundancy Rules Completeness Rules Timeliness RulesC. Fürber, M. Hepp: 17Towards a Vocabulary for DQMIn SemWeb Architectures
  18. 18. Quality-Influencing Artifacts Current Focus of DQM-Vocabulary DataC. Fürber, M. Hepp: 18Towards a Vocabulary for DQMIn SemWeb Architectures
  19. 19. Design Alternatives: Statements about Classes & Properties(1) Using classes and properties as subjects(2) Using datatype properties with xsd:anyURI(3) Mapping class and property URI‘s to new URI‘sC. Fürber, M. Hepp: 19Towards a Vocabulary for DQMIn SemWeb Architectures
  20. 20. Part 5: Application ExamplesC. Fürber, M. Hepp: 20Towards a Vocabulary for DQMIn SemWeb Architectures
  21. 21. Example 1: Legal Value Rule (1/3) What instances have illegal values for property foo:country ?C. Fürber, M. Hepp: 21Towards a Vocabulary for DQMIn SemWeb Architectures
  22. 22. Example 1: Legal Value Rule (2/3) dqm:LegalValueRule Class Instance Literal value foo:LegalValueRule_1 “tref:Countries“ “foo:Countries“ “tref:countryName“ “foo:countryName“C. Fürber, M. Hepp: 22Towards a Vocabulary for DQMIn SemWeb Architectures
  23. 23. Example 1: Legal Value Rule (3/3)C. Fürber, M. Hepp: 23Towards a Vocabulary for DQMIn SemWeb Architectures
  24. 24. Example 2: DQ-Assessment (1/2) How syntactically accurate are all properties that are subject to LegalValueRules?C. Fürber, M. Hepp: 24Towards a Vocabulary for DQMIn SemWeb Architectures
  25. 25. Example 2: DQ-Assessment (2/2)C. Fürber, M. Hepp: 25Towards a Vocabulary for DQMIn SemWeb Architectures
  26. 26. Part 6: Conclusions & Planned WorkC. Fürber, M. Hepp: 26Towards a Vocabulary for DQMIn SemWeb Architectures
  27. 27. Advantages of DQM-Voabulary• Minimizes human effort for DQM• Web-Scale sharing/reuse of data quality requirements• Consistency checks among data quality requirements• Transparency about applied data quality rulesC. Fürber, M. Hepp: 27Towards a Vocabulary for DQMIn SemWeb Architectures
  28. 28. Limitations• Representation of complex functional dependency rules and derivation rules• Limited experience on real world-data sets• Currently no own concepts for classes and properties• Research still in progressC. Fürber, M. Hepp: 28Towards a Vocabulary for DQMIn SemWeb Architectures
  29. 29. Future Work• Evaluation of design alternatives• Development of processing framework• Representation of more complex functional dependency rules / derivation rules• Extension of DQM-Vobulary• Evaluation on real-world data sets• Publication at http://semwebquality.orgC. Fürber, M. Hepp: 29Towards a Vocabulary for DQMin SemWeb Architectures
  30. 30. Christian Fürber Researcher E-Business & Web Science Research Group Werner-Heisenberg-Weg 39 85577 Neubiberg Germany skype c.fuerber email christian@fuerber.com web http://www.unibw.de/ebusiness homepage http://www.fuerber.com twitter http://www.twitter.com/cfuerberPaper available at http://bit.ly/gYEDdQ 30
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×