LDQ 2014 DQ Methodology


Published on

"Methodology for Assessment of Linked Data Quality: A Framework" at Workshop on Linked Data Quality

Published in: Data & Analytics
  1. 1. A Methodology for Assessment of Linked Data Quality Anisa Rula Amrapali Zaveri
  2. 2. Outline ➢Linked Data Quality ○ Current State ○ Limitations ➢Quality Assessment Methodology ○ 3 phases, 6 steps ➢Conclusion ○ Future Work
  3. 3. Linked Data Quality ● c.a. 50 Billion Facts in the Linked Data Cloud ● But, what about the quality? ● Data is only as good as its quality !
  4. 4. Linked Data Quality ➢30 approaches, 18 Dimensions, 69 Metrics* ➢12 Tools ○ Automated ○ Semi-automated ➢No generalized methodology ➢Not taking into account the actual use case/user requirements ➢Only assessment, no improvement *
  5. 5. Quality Assessment Methodology for Linked Data ➢3 Phases ➢6 steps
  6. 6. Phase I: Requirement Analysis Step I: Use Case Analysis - Description that best illustrates the intended usage of the dataset(s) Two types of users ➢Consumers ➢Potential consumers
  7. 7. Phase II: Quality Assessment Step II: Identification of quality issues ➢Based on the use case ➢Checklist-based approach ➢Yes - 1, No - 0 ➢List of quality dimensions
  8. 8. Phase II: Quality Assessment Step III: Statistics and Low-level Analysis ➢Generic statistics ➢Example ○ Interlinking degree ○ Blank nodes
  9. 9. Phase II: Quality Assessment Step IV: Advanced Analysis ➢High-level metrics ➢Example ○ Accuracy ○ Completeness ➢Requires (i) input and (ii) target dataset
  10. 10. Data Quality Score ➢Ratio ○ DQscore = 1 - (V/T) ■ V - total no. of instances that violate a DQ rule ■ T - total no. of relevant instances ■ for each property ○ DQweightedscore= (DQscore * wi / W) ■ wi - weight ■ W - sum of all weighted factors of the properties ■ for quality of overall properties
  11. 11. Phase III: Quality Improvement Step V: Root Cause Analysis ➢Analyze cause of each quality issue ➢Helps user interpret the results ➢Detect whether the problem occurs in the original dataset ➢In case original dataset is unavailable, analyze the available dataset to determine the cause
  12. 12. Phase III: Quality Improvement Step VI: Fixing Quality Problems ➢Semi-automatic ○ Consistency ○ Completeness ○ Syntactic validity ➢Crowdsourcing* ○ Semantic accuracy ○ Datatypes ○ Interlinks * Acosta et al., Crowdsourcing Linked Data Quality Assessment. ISWC 2013.
  13. 13. Conclusion and Future Work ➢Assessment methodology - 3 phases, 6 steps ➢Focus on use case ➢Improvement phase ! Future Work ➢Application to an actual use case ➢Build a tool
  14. 14. Thank you Questions Suggestions Comments @AnisaRula @amrapaliz