A database approach to monitoring the quality of information in RDF stores


Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A database approach to monitoring the quality of information in RDF stores

  1. 1. A DATABASE APPROACH TO MONITORING THE QUALITY OF INFORMATION IN RDF STORES Alexandre Rademaker and Edward HermannWednesday, November 30, 11
  2. 2. NOTES This is not a research report, this is a research propose! Let us start by looking results from database researchers.Wednesday, November 30, 11
  3. 3. WHAT IS (ENSURE) DATA QUALITY? Semantic properties of databases can be represented by integrity constraints! Integrity enforcement means maintain correctness of database. Truth Maintenance! Hendrik, 2011Wednesday, November 30, 11
  4. 4. HENDRIK DECKER http://web.iti.upv.es/~hendrik/ Universidad Politécnica de ValenciaWednesday, November 30, 11
  5. 5. EXAMPLE A marriage is between one man and one women only. How can we model such constraint in a relational DB? We are talking about more than: check constraint, foreign key and primary key.Wednesday, November 30, 11
  6. 6. DB THEORY USES DATALOG Datalog is more expressive than SQL (transitive closure) SQL is FOL (dedidable for finite model) SELECT X WHERE Y (give me the binds that satisfy the clauses)Wednesday, November 30, 11
  7. 7. TWO WAYS TO ENFORCE INTEGRITY In each update, check if any integrity constraint is violated. (not always rigorously check due its performance penalty) Repair extant violations of constraints. (accumulation of inconsistency is inevitable) Hendrik, 2011Wednesday, November 30, 11
  8. 8. INCONSISTENCY-TOLERANT METHODS Rigorous way is to eliminate all inconsistency. Repair the whole database. Relaxation... partial (flexible) repairs! Absolute consistency is out of question due its intractability! Hendrik, 2011Wednesday, November 30, 11
  9. 9. FLEXIBILITY OF PARTIAL INCONSISTENCY Flexibility served in two ways: Integrity enforcement is more flexible. Don’t have to be done all at once. (constraint violations can be tolerated to be solved in appropriate moment) Some inconsistency may be unknown at update time. Total approach would fail in such situation. But... Hendrik, 2011Wednesday, November 30, 11
  10. 10. PARTIAL REPAIRS Absolute consistency is out of question due its intractability. But, naive inconsistency-tolerant repairs can be data- destructive. For a rational flexible repair strategy, one needs criteria (expressed in terms of metrics) Only admit repairs that are integrity-preserving! That is, total amount of integrity violation not increase after the repair. Hendrik, 2011Wednesday, November 30, 11
  11. 11. FORMAL DEFINITIONS For an update U (inserts, deletes) of database D, we denoted DU the updated database. D = database IC = integrity theory I = constraint U = update D(F) = true if F eval to true in D D(I) = true if I is satisfied in D D(IC) = true if all I in IC is satisfied in D Hendrik, 2011Wednesday, November 30, 11
  12. 12. FORMAL DEFINITIONS Let be an ordering antisymmetric, reflexive and transitive. For two elements in a lattice A and B, A B is their least upper bound. Hendrik, 2011Wednesday, November 30, 11
  13. 13. FORMAL DEFINITIONS We say that (µ, ) is an inconsistency metric if µ maps tuples (D, IC) to some lattice that is partially ordered by . Simple example of a metric is given by (D, IC) = D(IC) with the natural order true f alse of the range of . That is, integrity sat, D(IC) = true, mean lower inconsistency than integrity violation, D(IC) = false. Non trivial examples given by comparing or counting violated constraints. Hendrik, 2011Wednesday, November 30, 11
  14. 14. INCONSISTENCY METRICS Inconsistency metrics are used to decide if an update preserves integrity, that is, doesn’t create a integrity violation that doesn’t exist before the update. Intuitively, an update preserves integrity if it doesn’t increase the measured inconsistency For a metric (µ, ), an update U in a database D with integrity theory IC is integrity-preserving with regard to (µ, ) if µ(DU , IC) µ(D, IC). Hendrik, 2011Wednesday, November 30, 11
  15. 15. AND MORE... Inconsistency-tolerant integrity checking Repairs Computing and checking partial repairs Computing integrity-preserving repairs Hendrik, 2011Wednesday, November 30, 11
  16. 16. WHY WE ARE TALKING ABOUT IT?Wednesday, November 30, 11
  17. 17. WHY WE ARE TALKING ABOUT IT? Lattes@FGV Project (a unified KB of FGV research publications, researchers, skills etc), http://dck092.fgv.br/ Semantic Web brings, RDF, description logics, linked data etc. Our research topics include Logics and knowledge representation. RDF are the key concept of Semantic Web Relational has fixed model (TBOX of an ontology)Wednesday, November 30, 11
  18. 18. TOPOS: THEORETICAL PART scra tchi n g th e su rfac e! A topos (plural topoi or toposes) is a category with a quite expressive internal logic The category of graphs and graph-homomorphisms can be viewed as a topos. This topos already has a Heyting algebra that is used as the truth-basis of its internal logic. A Heyting algebra is a lattice with additional properties. This topos-theoretic view of RDF stores can be investigated in order to provide a natural way to provide foundations to partial repairs in RDF stores. Besides that, if we view traditional DBs as finite first-order logical structures, the category of (finite) first-order structures and homomorphism between then has its own internal logic. This internal logic can be investigated also regarding partial repairs.Wednesday, November 30, 11
  19. 19. LATTES@FGVWednesday, November 30, 11
  20. 20. LATTES@FGVWednesday, November 30, 11
  21. 21. LATTES@FGVWednesday, November 30, 11
  22. 22. LATTES@FGV: THE RDF KB http://dck092.fgv.br:10035/repositories/fgv (800k triples)Wednesday, November 30, 11
  23. 23. LATTES@FGV 480 CV Lattes and collected data from other sources (Qualis, Digital Library etc) in one triple store lots of errors (inconsistencies) for different reasons: poor user interface for input data, misinterpretation etc. How to identify the errors? (non ad-hoc matter) How to fix what can be fixed automatically?Wednesday, November 30, 11
  24. 24. INTEGRITY CONSTRAINTS IN RDF We can consider the extension of what was discussed so far to non-SQL KR/DB can be viewed as a graph The query language of RDF based stores, SPARQL, can be used to provide semantics to the store.Wednesday, November 30, 11
  25. 25. EXAMPLES An article referenced by a CV must have the author of this CV as one of its authors!Wednesday, November 30, 11
  26. 26. EXAMPLES If two resources were identified by reference to the same article, every author of the first one should also be related to the second one!Wednesday, November 30, 11
  27. 27. IN THE LAST EXAMPLE Of course, two publications cannot be considered the same comparing only their titles! We need entity alignment, similarity checker... Suppose we have identified all resources that represent the same real “entity” using ask { owl:sameAs, than ...   ?p1 owl:sameAs ?p2 ;       dc:creator ?c .   OPTIONAL {     ?p2 ?rel ?c .   }   FILTER( !bound(?rel) ) }Wednesday, November 30, 11
  28. 28. A LITTLE BIT ABOUT THE IDENTIFICATION OF SIMILARITY (defun assert-same-list (list) (let ((new nil)) (mapcar (lambda (pair) (let ((a (first pair)) (b (second pair))) (if (not (blank-node-p a)) (push (reverse pair) new) (push pair new)))) list) (dolist (pair new) (add-triple (first pair) !owl:sameAs (second pair))))) (select0/callback (?x ?y) #insert-same-as (q- ?x !rdf:type !foaf:Agent) (q- ?y !rdf:type !foaf:Agent) (q- ?x !foaf:name ?n) (q- ?y !foaf:name ?n) (lispp (upi< ?x ?y))) Naive approach: Shaking hands!Wednesday, November 30, 11
  29. 29. A LITTLE BIT ABOUT THE IDENTIFICATION OF SIMILARITY (defun components (vertices n generator) (do ((res nil) (vtx vertices (set-difference vtx (car res) :test #upi=))) ((null vtx) res) (push (ego-group (car vtx) n generator) res))) (defsna-generator same-journal (node) (select0 (?j) (q- (?? node) !bibo:issn ?i) (q- ?j !bibo:issn ?i) (lispp (utils::check-issn (part->value ?i))) (lispp (upi< node ?j)) (q- ?j !dc:title ?t2) (q- (?? node) !dc:title ?t1) (lispp (> (utils::jaro-winkler-distance (part->value ?t1) (part->value ?t2)) 0.7)))) (let ((nodes (mapcar #subject (get-triples-list :p !bibo:issn :limit nil)))) (dolist (g (components nodes 2 same-journal))) (merge-nodes g)) An ad-hoc solution: breath-first-search of connected components!Wednesday, November 30, 11