SlideShare a Scribd company logo
1 of 23
Query Processing
         Using Structure Index for RDF Data on the Web
         Thanh Tran and Günter Ladwig
         Institute AIFB, Karlsruhe Institute of Technology
         ducthanh.tran@kit.edu, guenter.ladwig@kit.edu




    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
1
Agenda

     Problem Introduction
     Approach
                Structure Index for RDF Data
                Structure-based Partitioning
                Structure-aware Query Processing
     Evaluation
     Conclusion




    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
2
RDF data
                                                              0                                                                 1




                                                                                                                     AuthorOf
                            Supervises             AuthorOf         Supervises                          Supervises                  Supervises
                 2                                            3                                 4   5                           6                 7




                                                                                                                     WorksAt
                                                   WorksAt




                                     Name                                                                                           Name
                  KIT                                         8                                                                 9                MIT




     - Consists of triples <s,p,o>
     - Triples form a graph, where vertices denote resources and their values, connected
       by directed labelled edges representing properties (i.e.,relations and attributes)
     - URIs are used as labels of edges and vertices representing resources
3   KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Conjunctive Queries
                                                                         z




                                    Supervises
                   x                                                          y

                                                                    WorksAt




                                        Name
                KIT                                                           u


     - Important fragment of widely used languages (SQL, SPARQL)
     - Consisting of triple patterns p(s,o) where p is a predicate and s and o are variables
       or constants
     - Distinguished variables, e.g. x, vs. undistinguished variables
     - Triple patterns constitute a query graph
    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
4
Conjunctive Query Answering
                                               z

                                                                                                                  0                                            1




                                                                                                                                                    AuthorOf
                                                                                                       AuthorOf
                  Supervises
     x                                             y
                                                                                           Supervises                 Supervises           Supervises              Supervises
                                                                                  2                               3                4   5                       6                 7
                                         WorksAt




                                                                                                                                                    WorksAt
                                                                                                       WorksAt
                     Name
    KIT                                            u
                                                                                                Name                                                               Name
                                                                                  KIT                             8                                            9                MIT




     - Graph pattern matching problem: a match of a query q on a graph G is a mapping h
       from the variables of q to vertices of G such that the substitution of variables in
       the graph-representation of q would yield a subgraph of G
     - A match h is a homomorphism from the “query graph” to the data graph
     - Query answering based on two basic operations: data loading and join

    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
5
State-of-the-art
     Data Partitioning
                Vertical partitioning (SW-Store)
     Indexing
                Sextuple indexing (Hexastore)
                Materialization and indexing of entire join paths (GRIN)
     Index Implementation
                B+ tree
                Inverted index (Semplore)
                Index compression (RDF-3X)
     Query processing
                Sorted merge join based on vertical partitioning and indexing (SW-Store)
                Join order optimization based on dynamic programming (RDF-3X)
     A combination of different concepts makes up the state-of-the-art!




    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
6
Large Volume of RDF Data on the Web




    - ̴10 billions RDF triples (2009)
    - Interlinked by ̴10 millions mappings (2009)
    - Besides linked data, there are standalone ontologies, RDFa, etc.
    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
7
Semi-structured RDF data on the Web
                                      0                                                                    1




                                                                                             AuthorOf
           Supervises      AuthorOf       Supervises                                Supervises                 Supervises
  2                                   3                      4             5                               6                 7




                                                                                                 WorksAt
                           WorksAt




  KIT            Name                 8                                                                    9   Name         MIT




                                                                         Publication
                                                                         AuthorOf


                                                                                                                  - RDF graph often contains both
                                                                                                                    data and schema information
                        PhD                    Supervises                                                         - Resources are linked with a
                                                                             Institute
                      Student                                                                                       rdf:class via rdf:type
                                                                         WorksAt




                                                                                                                  - Schema information incomplete,
                                                                                                                    especially Web data, RDFa data
                                                                                                                   RDF data might be schema-less,
                                                       Name                   Post Doc
                                      String                                                                        semi-structured data
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Overview of Our Approach

     Problems
        • Management of possibly semi-structured RDF data on the Web
        • Scalability and efficiency of RDF Web data query processing


     Contributions
        • Parameterized structure index for RDF data
        • Structure-based partitioning (SP)
        • Structure-aware query processing

     Benefits
        • Reduction of unions & joins as well as IO cost



    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
9
Structure Index for RDF data on the Web
             B1 :            AuthorOf                         B2 :
                                                                                                                   0                                            1
             3,7                                              0,1




                                                   AuthorOf
         WorksAt




                                                                                                                                                     AuthorOf
                                                                                                        AuthorOf
                                                                                            Supervises                 Supervises           Supervises              Supervises
             B3 :             WorksAt                    B4 :
                                                                                   2                               3                4   5                       6                 7
             8,9                                        2,4,6
                                                     Supervises




                                                                                                                                                     WorksAt
                                                                                                        WorksAt
         Name




                                                                                                 Name                                                               Name
     B5:KIT,MIT                                          B6 : 5                    KIT                             8                                            9                MIT




                  Structure index is a graph
                     Is a structural description more fine-granular then a schema
                     Consists of classes (extensions) and relations between them
                     Resources in an extension exhibit the same structure, i.e., cannot be distinguished by
                      outgoing (forward bisimilarity) and incoming (backward bisimilarity) “edge trees”
                     Parameterize bisimulation by two sets of edge labels


     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
10
Structure-based Partitioning
             B1 :            AuthorOf                         B2 :                          SP B4 table
             3,7                                              0,1                                                 VP AuthorOf table
                                                                                           Sub  Property    Obj



                                                   AuthorOf
         WorksAt




                                                                                                                     Sub    Obj
                                                                                           2     AuthorOf   0
                                                                                                                     2      0
                                                                                           4     AuthorOf   0
             B3 :             WorksAt                    B4 :
             8,9                                        2,4,6                                                        4      0
                                                     Supervises                            6     AuthorOf   1
                                                                                                                     6      1
         Name




                                                                                           2     WorksAt    8
                                                                                                                     3      0
                                                                                           4     WorksAt    8
                                                                                                                     7      1
     B5:KIT,MIT                                          B6 : 5                            6     WorksAt    9


                  Whether a graph vertex instantiates a variable of a query depends on its
                   structure  vertices physically grouped based on structural similarity
                  Apply grouping captured by the structure index to the physical organization
                     Creating a physical group for every vertex
                     Triples are in the same group when their subjects belong to the same extension
                  Triples of a SP table satisfy not only the property of a triple pattern but also,
                   provide some structural guarantee, e.g., match the entire query structure
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
11
Structure-aware Query Processing

      Proposition 1
                 A mapping of q into G exists only if it also exists into the
                  associated index graph G’.
                 The resulting extensions that match the nodes in q will
                  contain all data graph matches.




      2-steps query processing
         Index graph: find extensions Ei matching q
         Data graph: combining data elements retrieved for Ei

     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
12
Index Graph Matching
             B1                                                          B3                                           B5
                               WorksAt                                                           Name
                                                                                                                             h1 = {B1, B2, B3, B4, B5}
              y                                                              u                                       KIT
                                                                                 u                                     KIT   h2 = {B2, B3, B4, B5, B6,}
       AuthorOf




                                                                   WorksAt




              z                                                              x
                  z                                                              x                                     y
                                             AuthorOf                                                   Supervises


             B2                                                          B4                                           B6

      Retrieve index graph edges matching query edges (triple patterns)
      Join index graph edges along query edges
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
13
Query Pruning

      Proposition 2
                 If a query is tree-shaped, and consists only of
                  undistinguished variables (besides the root), matches on
                  the structure index contain all and only data graph
                  matches.


      Data elements contained in the extensions matching the
       query root node represent all and only final query answers
      Given such queries, no further processing is needed
      Given more general queries, tree-shaped query parts can be
       pruned away
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
14
Query Pruning
             B1                                                          B3                                           B5
                               WorksAt                                                           Name                      h1 = {B1, B2, B3, B4, B5}
            y                                                           u                                            KIT
       AuthorOf




                                                                   WorksAt




           z                                                            x
                                             AuthorOf                                                   Supervises


       B2                    B4                      B6
      Elements in extensions are known to satisfy query structure
      Elements in B4 are already known to be authors of some z
      No further data processing is needed for this part
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
15
Data Graph Matching
                 B1                                                                B3                            B5
                                   WorksAt                                                        Name
3 WorksAt 8                                                         8 Name KIT                                        h’1 = { 3 WorksAt 8,
7 WorksAt 9                                                         9 Name MIT                                              3 Supervises 2,
3 Supervises 2                                                                                                              2 WorksAt 8,
3 Supervises 4                                                                                                              8 Name KIT}
           AuthorOf




7 Supervises 6
...
                                                                         WorksAt




                                                                            2 WorksAt 8
                                                  AuthorOf                  4 WorksAt 8             Supervises
                                                                            6 WorksAt 9
                 B2                                                         ...
                                                                              B                                  B6
                                                                                    4

       Retrieve triples from matching extensions & join along query edges
       Match class processing: group index graph matches to match classes to
        avoid processing matches that partially overlap
      KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
 16
Evaluation

      DBLP and several synthetic datasets created using the
       Lehigh University Benchmark (LUBM)
      30 queries categorized into five classes
                       Single-atom query                                                                        Graph-shaped query
                                                                              Star query
                       SELECT ?x                           QDBLP1
                         type (x, Person)                                      SELECT ?x, ?n         QDBLP12                             QLUBM15
                                                                                 type (x, Person)               SELECT ?x ?a
                                                                                 name (x, n)                      teacherOf (FullProfessor5, y)
                                                                                 editor (y, x)                    takesCourse (x, y)
                                                                                 author (z, x)                    publicationAuthor (b, x)
                       Entity query
                                                                                 cites (u, z)                     name (b, Publication7)
                       SELECT ?x ?m           QLUBM9                                                              memberOf (x, z)
                        emailAddress (x, fp@edu)                                                                  memberOf (a, z)
                                                                              Path query                          advisor (x, a)
                        res.Interest (x, research24)
                        telephone (x, xxx-xxx-xxxx)                                                    QLUBM6     telephone (a, xxx-xxx-xxxx)
                                                                              SELECT ?x ?y
                                                                               takesCourse (x, y)
                                                                               teacherOf (z, y)
                                                                               type (z, FullProfessor)




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
17
Evaluation – Performance
                                         SP         VP                                               idx match         load(VP-SP)      join(VP-SP)        # removed query nodes
 100000.0                                                                                      100000.0
  10000.0                                                                                          10000.0
     1000.0                                                                                         1000.0
      100.0                                                                                          100.0

       10.0                                                                                           10.0

        1.0                                                                                            1.0

        0.1                                                                                            0.1
                  q1
                  q2
                  q3
                  q4
                  q5
                  q6

                  q8
                  q9
                  q7



                 q10
                 q11
                 q12
                 q13
                 q14
                 q15
                Mean
                                                                                                             q1   q2    q3   q4   q5   q6   q7   q8   q9   q10 q11 q12 q13 q14 q15

                      Total time in ms on DBLP                                                               Time of separate steps in ms, #pruned query nodes

        Compare our work (SP) against vertical partitioning (VP) [Abadi et al.]
             Total query processing times
             Times of individual steps involved
        Slightly slower w.r.t simple queries (1-3)
        SP 8-9 times faster w.r.t complex queries (4-15)
        With more complex queries, the overhead incurred by answer space
       matching can be outweighed by the accumulated gain for load and join
       KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
18
Conclusions

      Structure index that can deal with general graph-
       structured RDF data on the Web
      Structure index can be leveraged for dealing with
       semi-structured data on the Web
      Structure index can be used for RDF data
       partitioning & query processing, allowing complex
       queries to be processed many times faster
      Future work
                 Adopt existing concepts in XML data management for
                  structure index optimization & updates
                 Query optimization for structure-aware query processing
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
19
Thank you for your attention!



                                        Structure Index for RDF Data on the Web
                                        Duc Thanh Tran, AIFB Institute, KIT
                                        E-Mail: ducthanh.tran@kit.edu
                                        Web: http://sites.google.com/site/kimducthanh




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
20
State-of-the-art
      Data Partitioning
                 Big table (Old versions of Oracle, Jena, Sesame)
                 Property tables (Jena)
                 Vertical partitioning (SW-Store)
      Indexing
                 Multiple indexing (YARS)
                 Sextuple indexing (Hexastore)
                 Materialization and indexing of entire join paths (GRIN)
      Index Implementation
                 B+ tree
                 Inverted index (Semplore)
                 Index compression (RDF-3X)
      Query processing
                 Sorted merge join based on vertical partitioning and indexing (SW-Store)
                 Join order optimization based on dynamic programming (RDF-3X)
      A combination of different concepts makes up the state-of-the-art!
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
21
Overview of Our Approach
     Problems
         • Management of possibly semi-structured RDF data on the Web
         • Scalability and efficiency of RDF Web data query processing

     Contributions
         • Parameterized structure index for RDF data
         • Structure-based partitioning (SP): triples with same structure are grouped
         • Structure-aware query processing
           • Use structure index to focus on data that satisfy the overall query structure
           • Then retrieves data in corresponding structure-based partitioned tables

     Benefits
         • Target data partitioning & query processing, i.e., complementary to other concepts
         • Reduction of unions & joins as well as IO cost

     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
22
Evaluation – Scalability
                                                                                                                                   10000.00
                   25000
                                                       OSQP                                                                                                  VPQP-SQP               SQP idx match




                                                                                                                   Processing Times [ms]
                   20000                               SQP                                                                                 8000.00
                                                                                                                                                             load (VPQP-SQP)        join(VPQP-SQP)
Query Times (ms)




                   15000                                                                                                                   6000.00

                   10000                                                                                                                   4000.00

                    5000                                                                                                                   2000.00

                       0                                                                                                                      0.00
                                  DBLP             LUBM1              LUBM5            LUBM10             LUBM50                                     LUBM1      LUBM5      LUBM10      LUBM20        LUBM50


                     Measured the average query performance for LUBM with varying size
                     Times increases with the size of the data
                     Gain for load and join increases in larger proportion than the overhead
                    incurred for index match
                          Match performance is determined by the size of the index graph
                          Size depends on structure but not on the size of the data graph
                          Match time does not necessarily increase when the data becomes larger
                          Positive effect of data filtering (IO reduction) and query pruning (load and
                         join) correlates with the data size
                    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
23

More Related Content

Recently uploaded

KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 

Recently uploaded (20)

KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 

Featured

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 

Featured (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

Query Processing Using Structure Index for RDF Data on the Web

  • 1. Query Processing Using Structure Index for RDF Data on the Web Thanh Tran and Günter Ladwig Institute AIFB, Karlsruhe Institute of Technology ducthanh.tran@kit.edu, guenter.ladwig@kit.edu KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 1
  • 2. Agenda  Problem Introduction  Approach  Structure Index for RDF Data  Structure-based Partitioning  Structure-aware Query Processing  Evaluation  Conclusion KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 2
  • 3. RDF data 0 1 AuthorOf Supervises AuthorOf Supervises Supervises Supervises 2 3 4 5 6 7 WorksAt WorksAt Name Name KIT 8 9 MIT - Consists of triples <s,p,o> - Triples form a graph, where vertices denote resources and their values, connected by directed labelled edges representing properties (i.e.,relations and attributes) - URIs are used as labels of edges and vertices representing resources 3 KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
  • 4. Conjunctive Queries z Supervises x y WorksAt Name KIT u - Important fragment of widely used languages (SQL, SPARQL) - Consisting of triple patterns p(s,o) where p is a predicate and s and o are variables or constants - Distinguished variables, e.g. x, vs. undistinguished variables - Triple patterns constitute a query graph KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 4
  • 5. Conjunctive Query Answering z 0 1 AuthorOf AuthorOf Supervises x y Supervises Supervises Supervises Supervises 2 3 4 5 6 7 WorksAt WorksAt WorksAt Name KIT u Name Name KIT 8 9 MIT - Graph pattern matching problem: a match of a query q on a graph G is a mapping h from the variables of q to vertices of G such that the substitution of variables in the graph-representation of q would yield a subgraph of G - A match h is a homomorphism from the “query graph” to the data graph - Query answering based on two basic operations: data loading and join KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 5
  • 6. State-of-the-art  Data Partitioning  Vertical partitioning (SW-Store)  Indexing  Sextuple indexing (Hexastore)  Materialization and indexing of entire join paths (GRIN)  Index Implementation  B+ tree  Inverted index (Semplore)  Index compression (RDF-3X)  Query processing  Sorted merge join based on vertical partitioning and indexing (SW-Store)  Join order optimization based on dynamic programming (RDF-3X)  A combination of different concepts makes up the state-of-the-art! KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 6
  • 7. Large Volume of RDF Data on the Web - ̴10 billions RDF triples (2009) - Interlinked by ̴10 millions mappings (2009) - Besides linked data, there are standalone ontologies, RDFa, etc. KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 7
  • 8. Semi-structured RDF data on the Web 0 1 AuthorOf Supervises AuthorOf Supervises Supervises Supervises 2 3 4 5 6 7 WorksAt WorksAt KIT Name 8 9 Name MIT Publication AuthorOf - RDF graph often contains both data and schema information PhD Supervises - Resources are linked with a Institute Student rdf:class via rdf:type WorksAt - Schema information incomplete, especially Web data, RDFa data  RDF data might be schema-less, Name Post Doc String semi-structured data KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
  • 9. Overview of Our Approach Problems • Management of possibly semi-structured RDF data on the Web • Scalability and efficiency of RDF Web data query processing Contributions • Parameterized structure index for RDF data • Structure-based partitioning (SP) • Structure-aware query processing Benefits • Reduction of unions & joins as well as IO cost KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 9
  • 10. Structure Index for RDF data on the Web B1 : AuthorOf B2 : 0 1 3,7 0,1 AuthorOf WorksAt AuthorOf AuthorOf Supervises Supervises Supervises Supervises B3 : WorksAt B4 : 2 3 4 5 6 7 8,9 2,4,6 Supervises WorksAt WorksAt Name Name Name B5:KIT,MIT B6 : 5 KIT 8 9 MIT  Structure index is a graph  Is a structural description more fine-granular then a schema  Consists of classes (extensions) and relations between them  Resources in an extension exhibit the same structure, i.e., cannot be distinguished by outgoing (forward bisimilarity) and incoming (backward bisimilarity) “edge trees”  Parameterize bisimulation by two sets of edge labels KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 10
  • 11. Structure-based Partitioning B1 : AuthorOf B2 : SP B4 table 3,7 0,1 VP AuthorOf table Sub Property Obj AuthorOf WorksAt Sub Obj 2 AuthorOf 0 2 0 4 AuthorOf 0 B3 : WorksAt B4 : 8,9 2,4,6 4 0 Supervises 6 AuthorOf 1 6 1 Name 2 WorksAt 8 3 0 4 WorksAt 8 7 1 B5:KIT,MIT B6 : 5 6 WorksAt 9  Whether a graph vertex instantiates a variable of a query depends on its structure  vertices physically grouped based on structural similarity  Apply grouping captured by the structure index to the physical organization  Creating a physical group for every vertex  Triples are in the same group when their subjects belong to the same extension  Triples of a SP table satisfy not only the property of a triple pattern but also, provide some structural guarantee, e.g., match the entire query structure KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 11
  • 12. Structure-aware Query Processing  Proposition 1  A mapping of q into G exists only if it also exists into the associated index graph G’.  The resulting extensions that match the nodes in q will contain all data graph matches.  2-steps query processing  Index graph: find extensions Ei matching q  Data graph: combining data elements retrieved for Ei KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 12
  • 13. Index Graph Matching B1 B3 B5 WorksAt Name h1 = {B1, B2, B3, B4, B5} y u KIT u KIT h2 = {B2, B3, B4, B5, B6,} AuthorOf WorksAt z x z x y AuthorOf Supervises B2 B4 B6  Retrieve index graph edges matching query edges (triple patterns)  Join index graph edges along query edges KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 13
  • 14. Query Pruning  Proposition 2  If a query is tree-shaped, and consists only of undistinguished variables (besides the root), matches on the structure index contain all and only data graph matches.  Data elements contained in the extensions matching the query root node represent all and only final query answers  Given such queries, no further processing is needed  Given more general queries, tree-shaped query parts can be pruned away KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 14
  • 15. Query Pruning B1 B3 B5 WorksAt Name h1 = {B1, B2, B3, B4, B5} y u KIT AuthorOf WorksAt z x AuthorOf Supervises B2 B4 B6  Elements in extensions are known to satisfy query structure  Elements in B4 are already known to be authors of some z  No further data processing is needed for this part KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 15
  • 16. Data Graph Matching B1 B3 B5 WorksAt Name 3 WorksAt 8 8 Name KIT h’1 = { 3 WorksAt 8, 7 WorksAt 9 9 Name MIT 3 Supervises 2, 3 Supervises 2 2 WorksAt 8, 3 Supervises 4 8 Name KIT} AuthorOf 7 Supervises 6 ... WorksAt 2 WorksAt 8 AuthorOf 4 WorksAt 8 Supervises 6 WorksAt 9 B2 ... B B6 4  Retrieve triples from matching extensions & join along query edges  Match class processing: group index graph matches to match classes to avoid processing matches that partially overlap KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 16
  • 17. Evaluation  DBLP and several synthetic datasets created using the Lehigh University Benchmark (LUBM)  30 queries categorized into five classes Single-atom query Graph-shaped query Star query SELECT ?x QDBLP1 type (x, Person) SELECT ?x, ?n QDBLP12 QLUBM15 type (x, Person) SELECT ?x ?a name (x, n) teacherOf (FullProfessor5, y) editor (y, x) takesCourse (x, y) author (z, x) publicationAuthor (b, x) Entity query cites (u, z) name (b, Publication7) SELECT ?x ?m QLUBM9 memberOf (x, z) emailAddress (x, fp@edu) memberOf (a, z) Path query advisor (x, a) res.Interest (x, research24) telephone (x, xxx-xxx-xxxx) QLUBM6 telephone (a, xxx-xxx-xxxx) SELECT ?x ?y takesCourse (x, y) teacherOf (z, y) type (z, FullProfessor) KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 17
  • 18. Evaluation – Performance SP VP idx match load(VP-SP) join(VP-SP) # removed query nodes 100000.0 100000.0 10000.0 10000.0 1000.0 1000.0 100.0 100.0 10.0 10.0 1.0 1.0 0.1 0.1 q1 q2 q3 q4 q5 q6 q8 q9 q7 q10 q11 q12 q13 q14 q15 Mean q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 Total time in ms on DBLP Time of separate steps in ms, #pruned query nodes  Compare our work (SP) against vertical partitioning (VP) [Abadi et al.]  Total query processing times  Times of individual steps involved  Slightly slower w.r.t simple queries (1-3)  SP 8-9 times faster w.r.t complex queries (4-15)  With more complex queries, the overhead incurred by answer space matching can be outweighed by the accumulated gain for load and join KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 18
  • 19. Conclusions  Structure index that can deal with general graph- structured RDF data on the Web  Structure index can be leveraged for dealing with semi-structured data on the Web  Structure index can be used for RDF data partitioning & query processing, allowing complex queries to be processed many times faster  Future work  Adopt existing concepts in XML data management for structure index optimization & updates  Query optimization for structure-aware query processing KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 19
  • 20. Thank you for your attention! Structure Index for RDF Data on the Web Duc Thanh Tran, AIFB Institute, KIT E-Mail: ducthanh.tran@kit.edu Web: http://sites.google.com/site/kimducthanh KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 20
  • 21. State-of-the-art  Data Partitioning  Big table (Old versions of Oracle, Jena, Sesame)  Property tables (Jena)  Vertical partitioning (SW-Store)  Indexing  Multiple indexing (YARS)  Sextuple indexing (Hexastore)  Materialization and indexing of entire join paths (GRIN)  Index Implementation  B+ tree  Inverted index (Semplore)  Index compression (RDF-3X)  Query processing  Sorted merge join based on vertical partitioning and indexing (SW-Store)  Join order optimization based on dynamic programming (RDF-3X)  A combination of different concepts makes up the state-of-the-art! KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 21
  • 22. Overview of Our Approach Problems • Management of possibly semi-structured RDF data on the Web • Scalability and efficiency of RDF Web data query processing Contributions • Parameterized structure index for RDF data • Structure-based partitioning (SP): triples with same structure are grouped • Structure-aware query processing • Use structure index to focus on data that satisfy the overall query structure • Then retrieves data in corresponding structure-based partitioned tables Benefits • Target data partitioning & query processing, i.e., complementary to other concepts • Reduction of unions & joins as well as IO cost KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 22
  • 23. Evaluation – Scalability 10000.00 25000 OSQP VPQP-SQP SQP idx match Processing Times [ms] 20000 SQP 8000.00 load (VPQP-SQP) join(VPQP-SQP) Query Times (ms) 15000 6000.00 10000 4000.00 5000 2000.00 0 0.00 DBLP LUBM1 LUBM5 LUBM10 LUBM50 LUBM1 LUBM5 LUBM10 LUBM20 LUBM50  Measured the average query performance for LUBM with varying size  Times increases with the size of the data  Gain for load and join increases in larger proportion than the overhead incurred for index match  Match performance is determined by the size of the index graph  Size depends on structure but not on the size of the data graph  Match time does not necessarily increase when the data becomes larger  Positive effect of data filtering (IO reduction) and query pruning (load and join) correlates with the data size KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 23

Editor's Notes

  1. In recent years, the amount of structured data available on theWeb has been increasing rapidly, especially RDF data consisting oftriples of the form hs; p; oi, where s is the subject, p is a property,and o is the object. Such triples form a data graph G(V; L;E)where the vertices V denote resources and their values, which areconnected by directed edges E, each endowed with a label from alabel set L. One example is shown in Fig. 1.
  2. This development of a data web opens new ways for addressing complex information needs. Search is no longer limited to matchingkeywords against documents, but instead, structured queries can be processed against web resources. In this regard, conjunctive queries represent an important fragment of widely used languages (SQL, SPARQL), which has been a focus of recent work on RDF data management [1, 10, 6]. Essentially, a query of this type consists of a set of triple patterns of the form p(s; o), where p is a predicate and s and o are variables (V arq) or constants (Conq).These conjunctive queries have high practical relevancebecause they are capable of expressing a large portion ofrelational queries. The vast majority of query languages usedin practice fall into this fragment, including large parts of SQLand SPARQL, the standard language for querying RDF.Intuitively speaking, variables appearing in the SELECT clause arecalled distinguished variables (V ardq ), otherwise undistinguishedvariables (V aruq ). Triple patterns constitute a query graph, as illustratedin Fig. 2b.
  3. A match of a conjunctive query q on a graph G is a mapping hfrom the variables of q to vertices of G such that the according substitutionof variables in the graph-representation of q would yielda subgraph of G. Therefore a query match h can be interpreted asa certain type of homomorphism (i.e. a structure preserving mapping)from the “query graph” to the data graph. Because the amountof data is enormous and largely increasing, scalability of this graphpattern matching on the data web has become a key issue.Search complexity increases substantially with the size of the graph
  4. Data organization &amp; indexing determines efficiency of data loading and efficiency of join depends on join implementation and join order optimizationState-of-the-art For this problem of matching a query graph pattern againstthe data graph, there are RDF stores, which retrieve data for every triple pattern and join it along the query edges.While the efficiency of retrieval depends on the physical data organization and indexing, the efficiency of join is largelydetermined by the join implementation and join order optimization strategies. We discuss these performance drivers that distinguish existing RDF stores:Data Partitioning Different schemes have been proposedTo govern the ways data is physically organized and stored. Abasic scheme is the triple-based organization, where one bigthree-columns table is used to store all triples. To avoid themany self-joins on the giant table, property-based partitioningis suggested [2], where data is stored in several “property tables”,each containing triples of one particular type of entities.Vertical partitioning (VP) has been proposed to decompose thedata graph into n two-columns tables, where n is number ofproperties [1]. As this scheme allows entries to be sorted, fastmerge joins can be performed.Indexing Scheme With multiple indexing, several indexesare created for supporting different lookup patterns. Thescheme with the widest coverage of access patterns is used inYARS [3], where six indexes are proposed to cover 16 possibleaccess patterns of quads (triple patterns plus one additionalcontext element). In [4], sextuple indexing has been suggested,which generalizes the strategy in [3] such that for differentaccess patterns, retrieved data comes in a sorted fashion. Infact, this work extends VP with the idea of multiple indexingto support fast merge joins on different and more complexquery patterns. Thus, this indexing technique goes beyondtriple lookup operations to support fast joins. Along this line,entire join paths have been materialized and indexed usingsuffix arrays [5]. A different path index based on judiciouslychosen “center nodes” coined GRIN has been proposed in [6].Index Implementation B+-tree is most commonly usedin current RDF stores. Recently, the inverted index typicallyused for IR tasks has been recognized as a viable choice forindexing large amounts of web data. It has been proposedto manage RDF data [7] and dataspaces [8]. Also, indexcompression techniques for RDF has been discussed [9].Query Processing &amp; Optimization Executing joins duringquery processing can be greatly accelerated when the retrievedtriples are already sorted. Through VP, retrieved data comesin sorted fashion, enabling fast merge joins [1]. This joinimplementation has near linear complexity, resulting in bestperformance. Sextuple indexing takes this further to allow thisjoin processing to be applied on many more query patterns,e.g. when the query contains unbound predicates such that pis a variable [4]. Further efficiency gains can be achieved byfinding an optimal query plan [9], which leverages dynamicprogramming that also involve bushy plans.It has been reported that there is no single system [10],but rather a combination of different concepts that makes upthe state-of-the-art in RDF data management. In particular,VP [1] is the candidate for physical data organization, multipleindexes [3] enable fast lookup, and optimized query plans [9]result in fast performance for complex join processing.
  5. We elaborate on concepts that improve the state-of-the-art in data partitioning and query processing:Parameterized Structure Index for RDF Data: Generalizingwork on XML data such as dataguide [5], we propose anindex called PIG that summarizes the structure of generalgraph structured data like RDF. The size of this index can becontrolled by means of parameters (e.g. derived from workload). Structure-based Partitioning: Based on PIG, we propose a structure-based partitioning scheme, where triples about elementswith the same structure are physically grouped. Thisis to obtain a contiguous storage of data that likely co-occursin query answers. Structure-aware Query Processing: We propose to matchthe query against the structure index first, which is typicallymuch smaller than the data graph (c.f. examples in Fig. 2).This helps to focus on data that satisfy the overall structureof the query and on this basis, to proceed with standard processingat the level of the data for only certain parts of thequery.Our solution is complementary to the concepts for indexing andquery optimization [10, 8], and offers the following additional benefits: Reduction of I/O Costs: We do not simply retrieve all datathat matches some given triple patterns but focus on the onethat satisfies the entire query structure. Reduction of Union and Joins: These operations are onlyneeded only for some parts of the query. In the extreme caseswhere no structure index matches can be found, we can skipdata access and joins at the data level completely.In a benchmark against the state-of-the-art techniques for datapartitioning and query processing used in SW-Store [1], our approachis 7-8 times faster for a PIG that is parameterized accordingto the query workload.Outline We introduce PIG in Section 2. Partitioning, query processingand parameterization are discussed in Section 3, 4 and 5.Experiments along with results are discussed in Section 6 beforewe review related work in Section 7 and conclude in Section 8. Formore details, we refer the interest readers to our technical report[2].
  6. PIG is a special graphforming a compact representation of the data graph, whose verticesstand for groups of data graph elements that have a similar or equalstructural “neighborhood”. We capture the concept of equal structuralneighborhood by the well-known notion of bisimulation originatingfrom the theoretical analysis of state-based dynamic systems.We adopt this notion to capture both directions of edges.We consider graph nodes v1; v2 as bisimilar (written: v1 v2),First, a bismulation for L1 and L2 is calculated, using an adapted version of the algorithm for determining thecoarsest stable refinement of a partitioning [9]. The algorithm starts with a partition consisting of a single block that contains all data,and splits into smaller blocks until the partition is a forward bisimulation.In order to perform both backward and forward bisimulationfor only the parameters L1 and L2, we essentially exploit the observationthat L1-forward-L2-backward bisimulation on a data graphG = (V; L;E) coincide with forward bisimulation on an altereddata graph GL1L2 = (V;L1 [ fl􀀀 j l 2 L2g;EL1L2g) whereEL1L2 = fl(x; y) j l(x; y) 2 E; l 2 L1g [ fl􀀀(y; x) j l(x; y) 2 E; l 2 L2g. After having determined the bisimulation, the resulting blocks from the partition P are used to form vertices in theindex graph according to Definition 1.
  7. Clearly, whether a graph vertexinstantiates a variable of a query obviously depends on its structuralproperties, i.e. the incoming and outgoing edges resp. paths.Therefore, if nodes are physically grouped together based on structuralsimilarity, a group would contain more candidates for variableinstantiations. Thus, we apply structure-based partitioning to thedata graph by creating a physical group (e.g. a table) for every vertexof the index graph, i.e. one group for every extension. Everygroup contains the triples, which “describe” elements contained inthe corresponding extension. That is, triples are in the same groupwhen they contain the same properties and their subjects belong tothe same extension.Recall that extensionsrepresent partitions of the data graph. Thus, grouping triplesbased on extensions guarantees an exhaustive and redundancyfreedecomposition of the data graph.Compared to VP, where triples with the same property aregrouped together, SP applies to triples that are similar instructure. Using VP tables, triples retrieved from disk matchthe property of a single triple pattern. However, whether sucha triple is also relevant for the entire query (i.e., contributesto the final results) depends on its structure. Since SP tablescontain only triples that are similar in structure, they, whenidentified to be relevant for a query, are likely to containrelatively more relevant triples. In fact, triples of a SP tableretrieved for a given query satisfy not only the property of atriple pattern of that query but also the entire query structure.Thus with SP, we can focus on relevant data. In effect, itreduces the amount of irrelevant data that might have to beretrieved from disk when using VP, and thus, can reduce I/Ocosts.
  8. Query processing in our scenario is essentially finding a homomorphismfrom the query graph q = (Vvar ] Vcon; L; P) to elementsof the data graph G = (V; L;E). According to the followingproposition, the structure index can be exploited to perform thistask:PROPOSITION 1. Let G be a data graph with associated indexgraph G and let q be another graph such that there is a homomorphismh from q into G. Then h with h(v) = [h(v)] is ahomomorphism from q into G.Intuitively speaking, a mapping into G exists only if it does alsoexist into the associated index graph G. Further, the resultingextensions Bi = [h(v)] from V in G that match the nodes inq will obviously contain the data graph matches h(v). Thus, the procedure for query processing can be decomposed into two steps:(1) finding matching extensions Bi on the index graph G first, (2)then combining data elements retrieved for Bi to obtain the finaldata graph matches. In the following, we denote GIdx as theindex used for the retrieval of elements from the index graph andGIdx is used for accessing elements of the data graph.
  9. Just like an answer, an index graph match is the result of a homomorphicmapping h from the query graph q(Vq;Lq; P) to theindex graph G(V ; L;E). Elements of an index graph matchare vertices of G that are assigned to variables and constants ofq. For this computation, we propose a join procedure that returns aresult table R containing all matches h : Vq ! V . First, a setof index graph candidate edges El is retrieved from G for everyquery edge label l occurring in the query (using GIdx). Then,these candidate sets are joined along the vertices of q to obtain R.Figure 3 illustrates two matches.
  10. The previous computation results in a set R of index graphmatches h : Vq ! V . Every element of these matches is anextension which essentially is a set of vertices of the queried datagraph G. According to Proposition 1, every match of the queryagainst the data graph is “contained” in one of the index graphmatches calculated so far (e.g. h(v1) is in h(v1) = [h(v1)]).It suffices to focus on the index graph matches for the computationof the data graph matches because only data contained by them satisfythe overall query structure. We will now show that tree-shapedparts containing only undistinguished variables can even be pruned away entirely. We now inductively define this notion of tree-shapedquery part:Given such a tree-shaped query part, a stronger property can beasserted for the index graph matches, i.e. they contain all and onlydata graph matches such that no further processing at the level ofthe data graph is needed:In words: if the query is of the aforementioned tree shape, thenevery data node from any extension associated to the query root rby an index graph match is a data graph match for r. Hence, beforecomputing data graph matches, the respective query parts can beremoved.
  11. Fig. 2b depicts a query, which asks for authors xworking at the same place as their supervisors y, namely a placecalled KIT. One match on the index graph is h1 = fu 7! B3; x 7! B4; y 7! B1; z 7! B2;KIT 7! B5g. Based on this,we know that data elements belonging to extensions obtained fromthe index graph match satisfy the query structure, e.g. elements inB4 are authors of z, supervised by y, work at some place u that hasa name. A tree-like part that can be pruned is AuthorOf(x; z). Itproduces the index graph match hB4 AuthorOf B2i. Since 2; 4;and 6 in B4 are already known to be authors of some z, no furtherdata processing is needed for this query part. However, we haveto look at the data to verify that elements in B4 work at KIT,and are supervised by some y also working at KIT. For this,we need to retrieve and join the triple matches for hxWorksAtui, hyWorksAtui, hu Name KITi, hy Supervises xi. Notethat the query example here contains cycles. In practice, there aremany queries exhibiting simpler structure, which offer greater potentialfor query pruning. In the extreme cases where no indexgraph matches can be found, we can skip the entire second stepto avoid data access and joins completely.
  12. After pruning the query, we use another join procedure to computea result table where rows capture bindings to distinguishedquery variables. These bindings are data elements contained in theindex graph matches h, which satisfy the structure as well as theconcrete elements (i.e. constants and distinguished variables) mentionedin the query. Query edges are processed successively. Atevery iteration, triples are retrieved from GIdx and are joined withthe (intermediate) results set. More precisely, given the query edgep(x; y), the triples hx 7! s; p; y 7! oi matching p(x; y) are considered.They are fetched from the corresponding block [s] of thestructure-based partitioned data graph index GIdx, where s 2 [s]and h : x ! [s]. Intuitively speaking, only triples with subjectsthat are contained in the index match [s] are retrieved from disk. Thus, only subjects that are known to satisfy the query structure areconsidered. This is different from the standard approaches [1, 8],where all triples matching the query edge are taken into account,which might contain subjects not in [s].However, we haveto look at the data to verify that elements in B4 work at KIT,and are supervised by some y also working at KIT. For this,we need to retrieve and join the triple matches for hxWorksAtui, hyWorksAtui, hu Name KITi, hy Supervises xi.The procedure presented in the previous section computes answersdata matches for an index graph match h. In order to computeall data matches, this has to be repeated for all index graphmatches h in R. However, the diverse matches might partiallyoverlap. To formalize and computationally exploit this, we introducethe following notion:In words, the preceding proposition ensures that all data graphmatches of a query can be obtained by a successive refinement ofmatch classes and their associated data matches. Consequently, theoptimized procedure for computing query data matches consists oftwo main parts: (1) update of match classes and (2) evaluation ofmatch classes.Match classes are defined w.r.t. query vertices. For the first part,match classes are thus created (updated) according to the query verticesthat are added during the process of join processing. At first,there is only one initial match class R consisting of all index graphmatches (line 1). During the processing of query atoms p(x; y),the set of classes MC becomes more and more “fine-grained”, asany matches not coinciding on how x and y are mapped to V willbe distributed to different match classes (line 11). A hash map H,which associates pairs of index matches (x-y-instantiations) withmatch classes, is employed to check for overlaps. Note that duringthe processing of the atoms in P, the number of classes grows ashigh as the number of matches, i.e. every match constitutes its ownclass.
  13. More optimized systems have been built that implement the conceptsof indexing and query optimization [8, 10]. Since these aspectsare orthogonal, we use the work in [1] as baseline, which isthe state-of-the-art in, and is purely focused on partitioning andquery processing. We compare our work called structure-basedquery processingWe now summarize the experiment reported in details in [2]. Itis based on DBLP and several synthetic datasets containing severalmillions of triples created using the Lehigh University Benchmark.A set of 30 queries categorized into five classes ranging fromsingle-atom query to complex structured graph-shaped queries hasbeen used. We use two parameterizations for the experiments: (1)SPB is based on G0B calculated using backward bisimulation onlyand (2) SPFB uses G0FB, a restricted back- and forward bisimulationadapted to the workload by setting L1;L2 to include onlylabels occurring in prunable query parts. G0FB is much smaller(4%-30%) and the indexes for G0B makes up only a small percentage(0.08%-2%) of the data graph.
  14. We have proposed techniques for RDF data partitioning andquery processing that can exploit the underlying structure to improvethe management of RDF data, based on a novel structureindex call PIG. In an principled manner, we showed that this approachis faster than the state-of-the-art, especially for complexstructured queries.As future work, we will elaborate on how existing work on RDFquery optimization can be used for the proposed structure-basedquery processing technique. Further, strategies proposed for optimizingupdates of XML structure indexes will be studied andadopted.
  15. Data organization &amp; indexing determines efficiency of data loading and efficiency of join depends on join implementation and join order optimizationState-of-the-art For this problem of matching a query graph pattern againstthe data graph, there are RDF stores, which retrieve data for every triple pattern and join it along the query edges.While the efficiency of retrieval depends on the physical data organization and indexing, the efficiency of join is largelydetermined by the join implementation and join order optimization strategies. We discuss these performance drivers that distinguish existing RDF stores:Data Partitioning Different schemes have been proposedTo govern the ways data is physically organized and stored. Abasic scheme is the triple-based organization, where one bigthree-columns table is used to store all triples. To avoid themany self-joins on the giant table, property-based partitioningis suggested [2], where data is stored in several “property tables”,each containing triples of one particular type of entities.Vertical partitioning (VP) has been proposed to decompose thedata graph into n two-columns tables, where n is number ofproperties [1]. As this scheme allows entries to be sorted, fastmerge joins can be performed.Indexing Scheme With multiple indexing, several indexesare created for supporting different lookup patterns. Thescheme with the widest coverage of access patterns is used inYARS [3], where six indexes are proposed to cover 16 possibleaccess patterns of quads (triple patterns plus one additionalcontext element). In [4], sextuple indexing has been suggested,which generalizes the strategy in [3] such that for differentaccess patterns, retrieved data comes in a sorted fashion. Infact, this work extends VP with the idea of multiple indexingto support fast merge joins on different and more complexquery patterns. Thus, this indexing technique goes beyondtriple lookup operations to support fast joins. Along this line,entire join paths have been materialized and indexed usingsuffix arrays [5]. A different path index based on judiciouslychosen “center nodes” coined GRIN has been proposed in [6].Index Implementation B+-tree is most commonly usedin current RDF stores. Recently, the inverted index typicallyused for IR tasks has been recognized as a viable choice forindexing large amounts of web data. It has been proposedto manage RDF data [7] and dataspaces [8]. Also, indexcompression techniques for RDF has been discussed [9].Query Processing &amp; Optimization Executing joins duringquery processing can be greatly accelerated when the retrievedtriples are already sorted. Through VP, retrieved data comesin sorted fashion, enabling fast merge joins [1]. This joinimplementation has near linear complexity, resulting in bestperformance. Sextuple indexing takes this further to allow thisjoin processing to be applied on many more query patterns,e.g. when the query contains unbound predicates such that pis a variable [4]. Further efficiency gains can be achieved byfinding an optimal query plan [9], which leverages dynamicprogramming that also involve bushy plans.It has been reported that there is no single system [10],but rather a combination of different concepts that makes upthe state-of-the-art in RDF data management. In particular,VP [1] is the candidate for physical data organization, multipleindexes [3] enable fast lookup, and optimized query plans [9]result in fast performance for complex join processing.
  16. We elaborate on concepts that improve the state-of-the-art in data partitioning and query processing:Parameterized Structure Index for RDF Data: Generalizingwork on XML data such as dataguide [5], we propose anindex called PIG that summarizes the structure of generalgraph structured data like RDF. The size of this index can becontrolled by means of parameters (e.g. derived from workload). Structure-based Partitioning: Based on PIG, we propose a structure-based partitioning scheme, where triples about elementswith the same structure are physically grouped. Thisis to obtain a contiguous storage of data that likely co-occursin query answers. Structure-aware Query Processing: We propose to matchthe query against the structure index first, which is typicallymuch smaller than the data graph (c.f. examples in Fig. 2).This helps to focus on data that satisfy the overall structureof the query and on this basis, to proceed with standard processingat the level of the data for only certain parts of thequery.Our solution is complementary to the concepts for indexing andquery optimization [10, 8], and offers the following additional benefits: Reduction of I/O Costs: We do not simply retrieve all datathat matches some given triple patterns but focus on the onethat satisfies the entire query structure. Reduction of Union and Joins: These operations are onlyneeded only for some parts of the query. In the extreme caseswhere no structure index matches can be found, we can skipdata access and joins at the data level completely.In a benchmark against the state-of-the-art techniques for datapartitioning and query processing used in SW-Store [1], our approachis 7-8 times faster for a PIG that is parameterized accordingto the query workload.Outline We introduce PIG in Section 2. Partitioning, query processingand parameterization are discussed in Section 3, 4 and 5.Experiments along with results are discussed in Section 6 beforewe review related work in Section 7 and conclude in Section 8. Formore details, we refer the interest readers to our technical report[2].
  17. Scalability We measured the average query performance forLUBM with varying size (i.e. generated for 1, 5, 10, 20 and 50universities). We found that the performance of our improves withthe size of the data. In particular, the gain for load and join increasesin larger proportion than the overhead incurred for indexmatch. This is because match performance is determined by thesize of the index graph. This depends on the structure but not onthe size of the data graph. Thus, the match time does not necessarilyincrease when the data graph becomes larger. The positiveeffect of data filtering (IO reduction) and query pruning (load andjoin) however, correlates with the data size.