Data Exchange over RDF

       Andr´s Letelier
            e
   Advisor: Marcelo Arenas

Pontificia Universidad Cat´lica de Chile
                         o


       September 1, 2011
What is data exchange?




   Problem
   Data under one schema S needs to be restructured and translated
   into a target schema T


                              S −→ T
                              IS −→ IT
Schema mappings



  Question
  Which source instances corresponds to which target instances?

  Answer
  Schema mappings:

                 M ⊆ Instances(S) × Instances(T)

  Usually, schema mappings are defined as M = (S, T, ΣST )
Definition (Solution)
I2 is a solution of I1 under M iif (I1 , I2 ) ∈ M
The set of all solutions for I1 under M is denoted by SolM (I1 )
Resource Description Framework (RDF)



      Data model for representing information about World Wide
      Web resources
      W3C Recommendation (1998)
      Part of the semantic web stack
      Directed, labeled graphs
      Blank nodes (labeled nulls)
      Basically, sets of triples (s, p, o)
Example
 D=   {
          (B1   name    paul)
          (B1   email   paul@example.edu)
          (B2   name    john)
          (B2   city    Liverpool)
                                            }
SPARQL (pronounced “sparkle”)


      Query language for RDF
      W3C Recommendation(2008)
      Standard for querying RDF datasets
      Returns sets of partial mappings
      Operators:
          Projection
          AND (inner join)
          OPT (left join)
          FILTER
          UNION
          and more
Example

          P1 = (?X, name, ?Y )

                     ?X    ?Y
          P1   D   = B1   paul
                     B2   john
Example

          P2 = (?X, name, ?Y ) AND (?X, email, ?Z)

                        ?X   ?Y            ?Z
           P2   D   =
                        B1   paul   paul@example.edu
Example

          P3 = (?X, name, ?Y ) OPT (?X, email, ?Z)

                      ?X    ?Y           ?Z
           P3   D   = B1   paul   paul@example.edu
                      B2   john
Well-designed SPARQL patterns


   Definition (Well-designed patterns)
   A pattern P is well designed if for every subpattern P of the form
   P1 OPT P2 , every variable that appears in P2 and outside P also
   appears in P1 .

   Example
       (?X, name, ?Y ) OPT ((?X, email, ?Z) OPT (?X, city, ?A))
       is well-designed
       (?X, name, ?Y ) OPT ((?W, email, ?Z) OPT (?X, city, ?A))
       is not
Data Exchange over RDF




      S and T are fixed to be RDF triples
      Tuple generating dependencies have to be redefined
      But first, we need some definitions...
RDF Tuple Generating Dependencies



   Let P be a SPARQL pattern, µ1 and µ2 be partial mappings, and
   Ω1 and Ω2 be sets of mappings. Then:
       var(P ) are the variables mentioned in P
       dom(µ1 ) is the domain of µ1
       A SPARQL SELECT query (denoted by (W, P ), where
       W ⊆ var(P )) is the projection of the evaluation of P onto
       the variables in W
RDF Tuple Generating Dependencies



   Let P be a SPARQL pattern, µ1 and µ2 be partial mappings, and
   Ω1 and Ω2 be sets of mappings. Then:
       µ1 is subsumed by µ2 (µ1 µ2 ) if dom(µ1 ) ⊆ dom(µ2 ), for
       every ?X in dom(µ1 ) that is not bound to a blank node we
       have that µ1 (?X) = µ2 (?X) and for every pair of variables
       ?X and ?Y in dom(µ1 ) such that µ1 (?X) = µ1 (?Y ) it is the
       case that µ2 (?X) = µ2 (?Y ).
       Ω1 is subsumed by Ω2 (Ω1 Ω2 ) if for every mapping µ1 in
       Ω1 there exists a mapping µ2 in Ω2 such that µ1 µ2 .
RDF Tuple Generating Dependencies



   (Re)Definition (Tuple Generating Dependencies)
   Let P1 and P2 be SPARQL patterns, and W ⊂ var(P1 ) ∩ var(P2 ).
   An RDF tgd is a sentence of the form

                            (W, P1 ) → (W, P2 )

   Given two RDF graphs G1 and G2 , and a set of tgds Σ,
   (G1 , G2 ) |= Σ if for every tgd (W, P1 ) → (W, P2 ) in Σ it is the
   case that (W, P1 ) G1        (W, P2 ) G2
RDF Schema Mappings




  Since S and T are fixed,

                             M=Σ


                G2 ∈ SolM (G1 ) ←→ (G1 , G2 ) |= Σ
Universal solutions

   Example
   Let W = {?X}, Σ =
   {(W, (?X, name, ?Y ) AND (?X, email, ?Z)) →
   (W, (?Y, hasmail, ?Z))}
   and consider the dataset D:

   Solution 1
    G2 =     {
                 (paul   hasmail   paul@example.edu)
                                                       }

   Solution 2
    G2 =     {
                 (paul   hasmail   paul@example.edu)
                 (john   hasmail    n)
                                                       }
Universal solutions




   Definition
   A solution G2 is universal if for every other solution G2 , G2   G2

       Solution 1 is universal
       Solution 2 is not
Universal solutions




   Not all settings have universal solutions:
   Consider G1 = {(1, 2, 3)}, W = {?X, ?Y } and

             Σ = {(W, (?X, ?Y, ?Z)) →
                   (W, ((?X, a, b) OPT (?W, b, ?Y ))
                    AND ((?X, c, d) OPT (?Z, d, ?Y )))}
Solution 1
 G2 =    {
             (1     a   b)
             ( n1   b   2)
             (1     c   d)
                             }

Solution 2
 G2 =    {
             (1     a   b)
             ( n2   d   2)
             (1     c   d)
                             }
This setting has no universal solution!
Good and bad news



  Bad news
  There is no ensurance that an exchange setting that has a solution
  will have a universal solution

  Good news
  If the heads of all tgds in Σ are well-designed and there is a
  solution, there is always a universal solution

  Better news
  We have an algorithm
“Chasing” SPARQL queries

 input A mapping µ and a (well-designed) SPARQL pattern P
output An RDF graph G such that µ ∈ P     G

  Chase(µ, ν, P, G)
      t:
      add unbound variables in t as fresh blank nodes to ν
      add ν(t) to G
      P1 AND P2 :
      Chase(µ, ν, P1 , G)
      Chase(µ, ν, P2 , G)
      P1 OPT P2 :
      Chase(µ, ν, P1 , G)
      if dom(µ)  dom(ν) ∩ var(P2 ) = ∅: Chase(µ, ν, P2 , G)
After chasing:




       µ     ν
       ν∈ P      G
       {µ}       P   G
       If we chase with every P2 in Heads(Σ) the evaluations of
        (W, P1 ) G1 , we get a universal solution.
Certain answers



   Definition (Certain answers on a regular data exchange setting)
   The set of certain answers is the intersection of the evaluation of
   the query over all the valid solutions

   Example
   Consider G1 = {(1, 2, 3)} and

              {({?X},(?X, ?Y, ?Z)) →
                      ({?X}, (?X, 1, 2) OPT (?X, ?Y, 3))}
Solution 1
 G2 =   {
             (1   1   2)                  (W, P2 )     G2   = {{?X → 1}}
                           }

Solution 2
 G2 =   {
             (1   1   2)
                                      (W, P2 )   G2   = {{?X → 1, ?Y → 2}}
             (1   2   3)
                        }
  The intersection of (W, P2 )   G2   and (W, P2 )     G2   is empty!
Certain answers


   Given a pattern P and a set of RDF graphs G, let Lower(P, G) be
   the set of all lower bounds of G w.r.t. subsumption.
   (Re)Definition (Certain Answers)
   The set of certain answers of a set of RDF graphs and a SPARQL
   pattern P is defined as any mapping Ω in Lower(P, G), such that
   for any other Ω in Lower(P, G) it is the case that Ω Ω .

   Claim
   All the possible sets of certain answers to an RDF data exchange
   setting are homomorfically equivalent.
Back in our previous example...



 Solution 1
  G2 =   {                            (W, P2 )   G2   = {{?X → 1}}
              (1   1   2)
                            }

 Solution 2
  G2 =   {
              (1   1   2)
              (1   2   3)
                                    (W, P2 ) G2 = {{?X → 1, ?Y → 2}}
                          }
   The set of certain answers is now {{?X → 1}}
In conclusion...




   Our contributions so far:
       RDF and SPARQL TGDs
       RDF Schema mappings
       Universal solutions
       Materialization of universal solutions
       Certain answers
In conclusion...




   To do:
       Prove remaining claims
       Query answering (using universal solutions)
       Incomplete information in the source instance
       Knowledge exchange over RDFs
Thank you for listening




   Any questions?

Data Exchange over RDF

  • 1.
    Data Exchange overRDF Andr´s Letelier e Advisor: Marcelo Arenas Pontificia Universidad Cat´lica de Chile o September 1, 2011
  • 2.
    What is dataexchange? Problem Data under one schema S needs to be restructured and translated into a target schema T S −→ T IS −→ IT
  • 3.
    Schema mappings Question Which source instances corresponds to which target instances? Answer Schema mappings: M ⊆ Instances(S) × Instances(T) Usually, schema mappings are defined as M = (S, T, ΣST )
  • 4.
    Definition (Solution) I2 isa solution of I1 under M iif (I1 , I2 ) ∈ M The set of all solutions for I1 under M is denoted by SolM (I1 )
  • 5.
    Resource Description Framework(RDF) Data model for representing information about World Wide Web resources W3C Recommendation (1998) Part of the semantic web stack Directed, labeled graphs Blank nodes (labeled nulls) Basically, sets of triples (s, p, o)
  • 6.
    Example D= { (B1 name paul) (B1 email paul@example.edu) (B2 name john) (B2 city Liverpool) }
  • 7.
    SPARQL (pronounced “sparkle”) Query language for RDF W3C Recommendation(2008) Standard for querying RDF datasets Returns sets of partial mappings Operators: Projection AND (inner join) OPT (left join) FILTER UNION and more
  • 8.
    Example P1 = (?X, name, ?Y ) ?X ?Y P1 D = B1 paul B2 john
  • 9.
    Example P2 = (?X, name, ?Y ) AND (?X, email, ?Z) ?X ?Y ?Z P2 D = B1 paul paul@example.edu
  • 10.
    Example P3 = (?X, name, ?Y ) OPT (?X, email, ?Z) ?X ?Y ?Z P3 D = B1 paul paul@example.edu B2 john
  • 11.
    Well-designed SPARQL patterns Definition (Well-designed patterns) A pattern P is well designed if for every subpattern P of the form P1 OPT P2 , every variable that appears in P2 and outside P also appears in P1 . Example (?X, name, ?Y ) OPT ((?X, email, ?Z) OPT (?X, city, ?A)) is well-designed (?X, name, ?Y ) OPT ((?W, email, ?Z) OPT (?X, city, ?A)) is not
  • 12.
    Data Exchange overRDF S and T are fixed to be RDF triples Tuple generating dependencies have to be redefined But first, we need some definitions...
  • 13.
    RDF Tuple GeneratingDependencies Let P be a SPARQL pattern, µ1 and µ2 be partial mappings, and Ω1 and Ω2 be sets of mappings. Then: var(P ) are the variables mentioned in P dom(µ1 ) is the domain of µ1 A SPARQL SELECT query (denoted by (W, P ), where W ⊆ var(P )) is the projection of the evaluation of P onto the variables in W
  • 14.
    RDF Tuple GeneratingDependencies Let P be a SPARQL pattern, µ1 and µ2 be partial mappings, and Ω1 and Ω2 be sets of mappings. Then: µ1 is subsumed by µ2 (µ1 µ2 ) if dom(µ1 ) ⊆ dom(µ2 ), for every ?X in dom(µ1 ) that is not bound to a blank node we have that µ1 (?X) = µ2 (?X) and for every pair of variables ?X and ?Y in dom(µ1 ) such that µ1 (?X) = µ1 (?Y ) it is the case that µ2 (?X) = µ2 (?Y ). Ω1 is subsumed by Ω2 (Ω1 Ω2 ) if for every mapping µ1 in Ω1 there exists a mapping µ2 in Ω2 such that µ1 µ2 .
  • 15.
    RDF Tuple GeneratingDependencies (Re)Definition (Tuple Generating Dependencies) Let P1 and P2 be SPARQL patterns, and W ⊂ var(P1 ) ∩ var(P2 ). An RDF tgd is a sentence of the form (W, P1 ) → (W, P2 ) Given two RDF graphs G1 and G2 , and a set of tgds Σ, (G1 , G2 ) |= Σ if for every tgd (W, P1 ) → (W, P2 ) in Σ it is the case that (W, P1 ) G1 (W, P2 ) G2
  • 16.
    RDF Schema Mappings Since S and T are fixed, M=Σ G2 ∈ SolM (G1 ) ←→ (G1 , G2 ) |= Σ
  • 17.
    Universal solutions Example Let W = {?X}, Σ = {(W, (?X, name, ?Y ) AND (?X, email, ?Z)) → (W, (?Y, hasmail, ?Z))} and consider the dataset D: Solution 1 G2 = { (paul hasmail paul@example.edu) } Solution 2 G2 = { (paul hasmail paul@example.edu) (john hasmail n) }
  • 18.
    Universal solutions Definition A solution G2 is universal if for every other solution G2 , G2 G2 Solution 1 is universal Solution 2 is not
  • 19.
    Universal solutions Not all settings have universal solutions: Consider G1 = {(1, 2, 3)}, W = {?X, ?Y } and Σ = {(W, (?X, ?Y, ?Z)) → (W, ((?X, a, b) OPT (?W, b, ?Y )) AND ((?X, c, d) OPT (?Z, d, ?Y )))}
  • 20.
    Solution 1 G2= { (1 a b) ( n1 b 2) (1 c d) } Solution 2 G2 = { (1 a b) ( n2 d 2) (1 c d) } This setting has no universal solution!
  • 21.
    Good and badnews Bad news There is no ensurance that an exchange setting that has a solution will have a universal solution Good news If the heads of all tgds in Σ are well-designed and there is a solution, there is always a universal solution Better news We have an algorithm
  • 22.
    “Chasing” SPARQL queries input A mapping µ and a (well-designed) SPARQL pattern P output An RDF graph G such that µ ∈ P G Chase(µ, ν, P, G) t: add unbound variables in t as fresh blank nodes to ν add ν(t) to G P1 AND P2 : Chase(µ, ν, P1 , G) Chase(µ, ν, P2 , G) P1 OPT P2 : Chase(µ, ν, P1 , G) if dom(µ) dom(ν) ∩ var(P2 ) = ∅: Chase(µ, ν, P2 , G)
  • 23.
    After chasing: µ ν ν∈ P G {µ} P G If we chase with every P2 in Heads(Σ) the evaluations of (W, P1 ) G1 , we get a universal solution.
  • 24.
    Certain answers Definition (Certain answers on a regular data exchange setting) The set of certain answers is the intersection of the evaluation of the query over all the valid solutions Example Consider G1 = {(1, 2, 3)} and {({?X},(?X, ?Y, ?Z)) → ({?X}, (?X, 1, 2) OPT (?X, ?Y, 3))}
  • 25.
    Solution 1 G2= { (1 1 2) (W, P2 ) G2 = {{?X → 1}} } Solution 2 G2 = { (1 1 2) (W, P2 ) G2 = {{?X → 1, ?Y → 2}} (1 2 3) } The intersection of (W, P2 ) G2 and (W, P2 ) G2 is empty!
  • 26.
    Certain answers Given a pattern P and a set of RDF graphs G, let Lower(P, G) be the set of all lower bounds of G w.r.t. subsumption. (Re)Definition (Certain Answers) The set of certain answers of a set of RDF graphs and a SPARQL pattern P is defined as any mapping Ω in Lower(P, G), such that for any other Ω in Lower(P, G) it is the case that Ω Ω . Claim All the possible sets of certain answers to an RDF data exchange setting are homomorfically equivalent.
  • 27.
    Back in ourprevious example... Solution 1 G2 = { (W, P2 ) G2 = {{?X → 1}} (1 1 2) } Solution 2 G2 = { (1 1 2) (1 2 3) (W, P2 ) G2 = {{?X → 1, ?Y → 2}} } The set of certain answers is now {{?X → 1}}
  • 28.
    In conclusion... Our contributions so far: RDF and SPARQL TGDs RDF Schema mappings Universal solutions Materialization of universal solutions Certain answers
  • 29.
    In conclusion... To do: Prove remaining claims Query answering (using universal solutions) Incomplete information in the source instance Knowledge exchange over RDFs
  • 30.
    Thank you forlistening Any questions?