Using SPARQL and SPIN for
 Data Quality Management
   on the Semantic Web
  Christian Fürber / Martin Hepp
   christian@fu...
Vision of the Semantic Web
                                                         Publishing data on the
               ...
Growth of Data:                                                                                             Retrieving
   ...
…but what if the published data was of
                poor quality?

                                         Get a giant...
Using Poor Data is Costly
   Without quality checks your SemWeb Apps will
   take this data seriously and…

              ...
Is there any way to avoid data
                        quality disasters?

Yes, if we know about data quality
problems, be...
The Impact of Poor Data Quality

                                                                                   Higher...
Data Quality is a Key Bottleneck of the
  Unique value violation
                         Semantic Web
<vocab:location rdf...
<vocab:location rdf:about="http://www.stockdbdemo2.com/stockdblocation/1">
                                               ...
Proposed Architecture

              SPARQL + SPIN                                            Query Layer

               ...
Defining Data Quality Rules with
                     SPARQL (1)
       Define what is allowed and negate it.



         ...
Defining Data Quality Rules with
                     SPARQL (2)
               The city „Las Vegas“ must be in the countr...
Defining Data Quality Rules with
                     SPARQL (3)
        High reusability of data quality rules through SP...
Enforced DQ-Rules with SPIN




                                         Application: http://www.topquadrant.com/products/...
More Data Quality Rule Templates (1)
  Data Quality Problem                               SPARQL Query Template
  Missing ...
More Data Quality Rule Templates (2)
  Data Quality Problem                               SPARQL Query Template
  Syntax v...
Contributions

• Domain-independent SPARQL query
  templates for data quality problem identification
• Queries are highly ...
Limitations & Open Issues
• Knowing the problem does not mean we can
  solve it
• Homonym / Synonym handling
• Incomplete ...
Ongoing Extensions
• Extension to a broader set of data quality problems
• Enabling synonym handling and homonym tolerance...
Christian Fuerber
       Researcher
       E-Business & Web Science Research Group

                     Werner-Heisenberg...
References & Links
     LOD-Cloud:
       http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2009-03-05.html

     D2R...
Upcoming SlideShare
Loading in …5
×

Using SPARQL and SPIN for Data Quality Management on the Semantic Web

6,699 views

Published on

Published in: Technology, Business

Using SPARQL and SPIN for Data Quality Management on the Semantic Web

  1. 1. Using SPARQL and SPIN for Data Quality Management on the Semantic Web Christian Fürber / Martin Hepp christian@fuerber.com, mhepp@computer.org Presentation @ BIS May 4th 2010
  2. 2. Vision of the Semantic Web Publishing data on the web in a meaningful way for more automation, better integration, and higher reusability of data. © Hanspeter Graf / www.pixelio.de C. Fürber, M. Hepp: Using SPARQL and SPIN for Data Quality Management on the Semantic Web 2
  3. 3. Growth of Data: Retrieving information Well on Track… Building smart Reference: http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2009-03-05.html SemWeb apps C. Fürber, M. Hepp: Using SPARQL and SPIN for Data Quality Management on the Semantic Web 3
  4. 4. …but what if the published data was of poor quality? Get a giant camcorder from amazon! C. Fürber, M. Hepp: 4 Using SPARQL and SPIN for Data Quality Management on the Semantic Web
  5. 5. Using Poor Data is Costly Without quality checks your SemWeb Apps will take this data seriously and… …get an oversized shipping package with expensive postage, …and waste transportation capacity. C. Fürber, M. Hepp: 5 Using SPARQL and SPIN for Data Quality Management on the Semantic Web
  6. 6. Is there any way to avoid data quality disasters? Yes, if we know about data quality problems, before anything bad will happen! A giant camcorder on the road! C. Fürber, M. Hepp: Using SPARQL and SPIN for Data Quality Management on the Semantic Web 6
  7. 7. The Impact of Poor Data Quality Higher Costs Missed Revenues Poor Decisions Lower Product / Failed Business Processes Service Quality Failed Projects Lower Stakeholder Satisfaction Fatal Disasters C. Fürber, M. Hepp: Using SPARQL and SPIN for Data Quality Management on the Semantic Web 7
  8. 8. Data Quality is a Key Bottleneck of the Unique value violation Semantic Web <vocab:location rdf:about="http://www.stockdbdemo2.com/stockdblocation/1"> <vocab:location_ZIP></vocab:location_ZIP> Missing literal values <vocab:location_STREETNO></vocab:location_STREETNO> <vocab:location_COUNTRY>France</vocab:location_COUNTRY> <vocab:location_ID rdf:datatype="http://www.w3.org/2001/XMLSchema#int" >1</vocab:location_ID> <vocab:location_STREET>8489 Strong St.</vocab:location_STREET> <vocab:location_STATE>NV</vocab:location_STATE> <rdfs:label>location #1</rdfs:label> Functional dependency violation <vocab:location_CITY>Las Vegas</vocab:location_CITY> </vocab:location> Syntax violation C. Fürber, M. Hepp: Using SPARQL and SPIN for Data Quality Management on the Semantic Web 8
  9. 9. <vocab:location rdf:about="http://www.stockdbdemo2.com/stockdblocation/1"> <vocab:location_ZIP></vocab:location_ZIP> Our Approach <vocab:location_STREETNO></vocab:location_STREETNO> <vocab:location_COUNTRY>France</vocab:location_COUNTRY> <vocab:location_ID rdf:datatype="http://www.w3.org/2001/XMLSchema#int" >1</vocab:location_ID> <vocab:location_STREET>8489 Strong St.</vocab:location_STREET> <vocab:location_STATE>NV</vocab:location_STATE> <rdfs:label>location #1</rdfs:label> <vocab:location_CITY>Las Vegas</vocab:location_CITY> </vocab:location> Identification of data quality problems on instance level of Semantic Web sources solely with Semantic Web technologies. Integration advantages Access to SemWeb data may be useful for dqm. C. Fürber, M. Hepp: Using SPARQL and SPIN for Data Quality Management on the Semantic Web 9
  10. 10. Proposed Architecture SPARQL + SPIN Query Layer Domain- SPIN Ontology Ontology Layer OBDQM Data Sources Layer Knowledge Linked RDB Base Data Cloud C. Fürber, M. Hepp: Using SPARQL and SPIN for Data Quality Management on the Semantic Web 10
  11. 11. Defining Data Quality Rules with SPARQL (1) Define what is allowed and negate it. Define what is not allowed. Negations and regular expressions save manual effort. C. Fürber, M. Hepp: Using SPARQL and SPIN for Data Quality Management on the Semantic Web 11
  12. 12. Defining Data Quality Rules with SPARQL (2) The city „Las Vegas“ must be in the country „USA“. # Checking functional dependency of {?arg4} with {?arg2} CONSTRUCT { _:b0 a spin:ConstraintViolation . _:b0 spin:violationRoot ?this . _:b0 spin:violationPath vocab:location_COUNTRY . } WHERE { ?this vocab:location_CITY „Las Vegas“ . FILTER (!spl:hasValue(?this, vocab:location_COUNTRY, “USA”)) . } C. Fürber, M. Hepp: Using SPARQL and SPIN for Data Quality Management on the Semantic Web 12
  13. 13. Defining Data Quality Rules with SPARQL (3) High reusability of data quality rules through SPIN‘s SPARQL query templates. # Checking functional dependency of {?arg4} with {?arg2} CONSTRUCT { _:b0 a spin:ConstraintViolation . _:b0 spin:violationRoot ?this . _:b0 spin:violationPath ?arg3 . } WHERE { ?this ?arg1 ?arg2 . FILTER (!spl:hasValue(?this, ?arg3, ?arg4)) . } C. Fürber, M. Hepp: Using SPARQL and SPIN for Data Quality Management on the Semantic Web 13
  14. 14. Enforced DQ-Rules with SPIN Application: http://www.topquadrant.com/products/TB_Composer.html#free C. Fürber, M. Hepp: 14 Using SPARQL and SPIN for Data Quality Management on the Semantic Web
  15. 15. More Data Quality Rule Templates (1) Data Quality Problem SPARQL Query Template Missing literal values ASK WHERE { ?this ?arg1 "" . } Out of range value ASK WHERE { ?this ?arg1 ?value . (lower limit) FILTER (?value < ?arg2) . } Out of range value ASK WHERE { ?this ?arg1 ?value . (upper limit) FILTER (?value > ?arg2) . } Global Ontology Knowledge RDB RDB Base C. Fürber, M. Hepp: Using SPARQL and SPIN for Data Quality Management on the Semantic Web 15
  16. 16. More Data Quality Rule Templates (2) Data Quality Problem SPARQL Query Template Syntax violation ASK WHERE { ?this ?arg1 ?value . (only letters and dots FILTER (!regex(str(?value), allowed) "^([A-Za-z,. ])*$"))} Unique value violation CONSTRUCT { _:b0 a spin:ConstraintViolation . _:b0 spin:violationRoot ?a . _:b0 spin:violationPath ?arg1 . } WHERE { ?a ?arg1 ?uniqueValue . ?b ?arg1 ?uniqueValue . FILTER (?a != ?b)} Global Ontology RDB RDB Knowledge Base C. Fürber, M. Hepp: Using SPARQL and SPIN for Data Quality Management on the Semantic Web 16
  17. 17. Contributions • Domain-independent SPARQL query templates for data quality problem identification • Queries are highly reusable • Architecture enables the use of Linked Data • Methodology for data quality management of Semantic Web data • First approach on how to apply SPIN for DQM C. Fürber, M. Hepp: Using SPARQL and SPIN for Data Quality Management on the Semantic Web 17
  18. 18. Limitations & Open Issues • Knowing the problem does not mean we can solve it • Homonym / Synonym handling • Incomplete knowledge may cause constraint violations of clean instances • Current approach focuses on literal values • Scalability on large data sets C. Fürber, M. Hepp: Using SPARQL and SPIN for Data Quality Management on the Semantic Web 18
  19. 19. Ongoing Extensions • Extension to a broader set of data quality problems • Enabling synonym handling and homonym tolerance • Enhancement of peformance • Calculation of information quality scores • Integration of Linked Data as trusted reference for data quality management • Evaluate the quality of popular Semantic Web data sets on instance level (e.g. Geonames & DBPedia) • Extension for (semi-)automated data cleansing C. Fürber, M. Hepp: Using SPARQL and SPIN for Data Quality Management on the Semantic Web 19
  20. 20. Christian Fuerber Researcher E-Business & Web Science Research Group Werner-Heisenberg-Weg 39 85577 Neubiberg Germany skype c.fuerber email christian@fuerber.com web http://www.unibw.de/ebusiness homepage http://www.fuerber.com Paper is available at http://bit.ly/bYes0V 20
  21. 21. References & Links LOD-Cloud: http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2009-03-05.html D2RQ: http://www4.wiwiss.fu-berlin.de/bizer/d2rq/spec/ SPIN: http://spinrdf.org/ TopBraid Composer Free Edition: http://www.topquadrant.com/products/TB_Composer.html#free C. Fürber, M. Hepp: 21 Using SPARQL and SPIN for Data Quality Management on the Semantic Web

×