Successfully reported this slideshow.

Sieve - Data Quality and Fusion - LWDM2012

3

Share

Loading in …3
×
1 of 27
1 of 27

Sieve - Data Quality and Fusion - LWDM2012

3

Share

Download to read offline

Presentation at the LWDM workshop at EDBT 2012.

The Web of Linked Data grows rapidly and already contains data originating from hundreds of data sources. The quality of data from those sources is very diverse, as values may be out of date, incomplete or incorrect. Moreover, data sources
may provide conflicting values for a single real-world object. In order for Linked Data applications to consume data from this global data space in an integrated fashion, a number of challenges have to be overcome. One of these challenges is to rate and to integrate data based on their quality.
However, quality is a very subjective matter, and nding a canonical judgement that is suitable for each and every task is not feasible.

To simplify the task of consuming high-quality data, we present Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods. Sieve is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity
Resolution, all crucial preliminaries for quality assessment and fusion.

We demonstrate Sieve in a data integration scenario importing data from the English and Portuguese versions of DBpedia, and discuss how we increase completeness, conciseness and consistency through the use of our framework.

Presentation at the LWDM workshop at EDBT 2012.

The Web of Linked Data grows rapidly and already contains data originating from hundreds of data sources. The quality of data from those sources is very diverse, as values may be out of date, incomplete or incorrect. Moreover, data sources
may provide conflicting values for a single real-world object. In order for Linked Data applications to consume data from this global data space in an integrated fashion, a number of challenges have to be overcome. One of these challenges is to rate and to integrate data based on their quality.
However, quality is a very subjective matter, and nding a canonical judgement that is suitable for each and every task is not feasible.

To simplify the task of consuming high-quality data, we present Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods. Sieve is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity
Resolution, all crucial preliminaries for quality assessment and fusion.

We demonstrate Sieve in a data integration scenario importing data from the English and Portuguese versions of DBpedia, and discuss how we increase completeness, conciseness and consistency through the use of our framework.

More Related Content

Related Books

Free with a 30 day trial from Scribd

See all

Sieve - Data Quality and Fusion - LWDM2012

  1. 1. Sieve Linked Data Quality Assessment and Fusion Pablo N. Mendes Hannes Mühleisen Christian Bizer With contributions from: Andreas Schultz, Andrea Matteini, Christian Becker, Robert Isele
  2. 2. “sieve” “A sieve, or sifter, separates wanted elements from unwanted material using a woven screen such as a mesh or net.” Source: http://en.wikipedia.org/wiki/Sieve
  3. 3. What is Linked Data? • Raw data (RDF) • Accessible on the Web • Data can link to other data sources Thing Thing Thing Thing Thing Thing Thing Thing Thing Thing data link data link data link data link A B C D E • Benefits: Ease of access and re-use; enables discovery
  4. 4. Linking Open Data Cloud http://lod-cloud.net
  5. 5. Linked Data Challenges • Data providers have different intentions, experience/knowledge • data may be inaccurate, outdated, spam etc. • Data sources that overlap in content may use… • ... different RDF schemata • ... different identifiers for the same real-world entity • …conflicting values for properties • Integrating public datasets with internal databases poses the same problems
  6. 6. An Architecture for Linked Data Applications
  7. 7. LDIF – Linked Data Integration Framework 1 Collect data: Managed download and update 2 Translate data into a single target vocabulary 3 Resolve identifier aliases into local target URIs 4 Assess quality, filter bad results, resolve conflicts 5 Output • Open source (Apache License, Version 2.0) • Collaboration between Freie Universität Berlin and mes|semantics
  8. 8. LDIF Pipeline 1 Collect data Supported data sources: 2 Translate data • RDF dumps (various formats) • SPARQL Endpoints 3 Resolve identities • Crawling Linked Data 4 Filter and fuse 5 Output
  9. 9. LDIF Pipeline 1 Collect data Data sources use a wide range of different RDF vocabularies 2 Translate data dbpedia-owl: City 3 Resolve identities R2R local:City schema:Place 4 Filter and fuse fb:location.citytown 5 Output • Mappings expressed in RDF (Turtle) • Simple mappings using OWL / RDFs statements (x rdfs:subClassOf y) • Complex mappings with SPARQL expressivity • Transformation functions
  10. 10. LDIF Pipeline 1 Collect data Data sources use different identifiers for the same entity 2 Translate data Berlin, Germany Berlin, CT Berlin, MD 3 Resolve identities Berlin, NJ Berlin, MA 4 Filter and fuse Berlin Berlin Silk = 5 Output Berlin, Germany • Profiles expressed in XML • Supports various comparators and transformations
  11. 11. LDIF Pipeline 1 Collect data Sources provide different values for the same property Total Area 2 Translate data 891.85 km2 891.82 km2 3 Resolve identities 891.82 km2 891.85 km2 4 Filter and fuse Total Area 5 Output Quality Sieve 891.85 km2 • Profiles expressed in XML • Supports various scoring and fusion functions
  12. 12. LDIF Pipeline 1 Collect data • Output options:N-Quads 2 Translate data • N-Triples 3 Resolve identities • SPARQL Update Stream 4 Filter and fuse • Provenance tracking using Named 5 Output Graphs
  13. 13. An Architecture for Linked Data Applications Data Quality and Fusion Module
  14. 14. Data Fusion “fusing multiple records representing the same real-world object into a single, consistent, and clean representation” (Bleiholder & Naumann, 2008)
  15. 15. Conflict resolution strategies • Independent of quality assessment metrics • Pick most frequent (democratic voting) • Average, max, min, concatenation • Within interval • Based on task-specific quality assessment • Keep highest scored • Keep all that pass a threshold • Trust some sources over others • Weighted voting
  16. 16. Data Fusion • Input: • (Potentially) conflicting data • Quality metadata describing input • Execution: • Use existing or custom FusionFunctions • Output: • Clean data, according to user’s definition of clean
  17. 17. Configuration: Data Fusion
  18. 18. Sieve: Quality Assessment • Quality as “fitness for use”: • Subjective: • good for me might not be enough for you • Task dependent: • temperature: planning a weekend vs biology experiment • Multidimensional: • even correct data may be outdated or not available • Requires task-specific quality assessment.
  19. 19. Data Quality - Conceptual Framework Dimension Accuracy Consistency Objectivity Timeliness Validity Believability Completeness Understandability Relevancy Reputation Verifiability Amount of Data Interpretability Rep. Conciseness Rep. Consistency Availability Response Time Security
  20. 20. Configuration: Quality Assessment • Quality Assessment Metrics composed by: • ScoringFunction (generically applicable to given data types) • Quality Indicator as input (adaptable to use case) [0;1] • Output: Describes input within a quality dimension, according to a user’s definition of quality
  21. 21. Configuration: Quality Assessment
  22. 22. More about Sieve • Software: Open Source, Apache V2 • Scoring Functions and Fusion Functions can be extended • Scala/Java interface, methods score/fuse and fromXML • Quality scores can be stored and shared with other applications • Website: http://sieve.wbsg.de • Documentation, examples, downloads, support
  23. 23. Use Case Multiple data sources (Complementary) (Heterogeneous) Conflicting values Quality indicators (Multidimensional) (Task-dependent) (Conflict Resolution Strategies) Voilá! User config
  24. 24. Evaluating Quality of Data Integration • Completeness • How many cities did we find? • How many of the properties did we fill with values? • Conciseness • How much redundancy is there in the object identifiers? • How much redundancy is there in the property values? • Consistency • How many conflicting values are there?
  25. 25. Results Generated data that is more complete, concise and consistent than in the original sources
  26. 26. Linked Data application Architecture My view on this data space can also be shared, and reused. We can “pay as we go”
  27. 27. THANK YOU! • Twitter: @pablomendes • E-mail: pablo.mendes@fu-berlin.de • Website: http://sieve.wbsg.de • Google Group: http://bit.ly/ldifgroup Supported in part by: Vulcan Inc. as part of its Project Halo EU FP7 projects: -LOD2 - Creating Knowledge out of Interlinked Data -PlanetData - A European Network of Excellence on Large-Scale Data Management

×