• Save
Query-Load aware partitioning of RDF data
Upcoming SlideShare
Loading in...5
×
 

Query-Load aware partitioning of RDF data

on

  • 1,486 views

Query-Load aware partitioning of RDF datasets using standard fragmentation techniques for relational databases aimed to provide an insight of the advantages of a proper fragmentation scheme in big ...

Query-Load aware partitioning of RDF datasets using standard fragmentation techniques for relational databases aimed to provide an insight of the advantages of a proper fragmentation scheme in big semantic databases for efficient query processing.

Statistics

Views

Total Views
1,486
Slideshare-icon Views on SlideShare
1,481
Embed Views
5

Actions

Likes
3
Downloads
0
Comments
0

2 Embeds 5

http://www.linkedin.com 3
http://www.slashdocs.com 2

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Query-Load aware partitioning of RDF data Query-Load aware partitioning of RDF data Presentation Transcript

    • Query-Load aware partitioning of RDF data Luis Galárraga Saarbrücken, July 4th 2011July 4th, 2011 Query load aware partitioning of RDF data 1/37
    • Outline ● Motivation & background ● Fragmentation in databases ● Observations & goals ● Proposed methodology ● Preliminary resultsJuly 4th, 2011 Query load aware partitioning of RDF data 2/37
    • Outline ● Motivation & background ● Fragmentation in databases ● Observations & goals ● Proposed methodology ● Preliminary resultsJuly 4th, 2011 Query load aware partitioning of RDF data 3/37
    • Outline ● Motivation & background ● Fragmentation in databases ● Observations & goals ● Proposed methodology ● Preliminary resultsJuly 4th, 2011 Query load aware partitioning of RDF data 4/37
    • Motivation ● Increasing interest in semantic representations for knowledge. – Increasing number of data providers (e.g Linked Data initiative) – Semantic Web: “Web of knowledge” – Growing data sources (e.g Wikipedia) ● Need for efficient query processing – Centralized solutions might become infeasible as data steadily grows. – Taking advantage of parallelism can help improve performance.July 4th, 2011 Query load aware partitioning of RDF data 5/37
    • Data keeps growing Dbpedia datasets size growth 40 Dbpedia 3.6 35 3.500.000 resources 30 0.5 billion facts 25 http://dbpedia.org Size in GB 20 Size 15 10 Semantic Web Challenge 2011 5 2 billion triples 0 10/10/06 02/22/08 07/06/09 11/18/10 04/01/12 20 GB dataset Date http://challenge.semanticweb.orgJuly 4th, 2011 Query load aware partitioning of RDF data 6/37
    • Data keeps growingJuly 4th, 2011 Query load aware partitioning of RDF data 7/37
    • RDF and triple stores ● Resource Description Framework is a language to represent knowledge about resources (things). – Resources are identified by URIs <http://www.mpii.de/yago/resource/John_Doe> ● It uses statements or triples PREFIX yago: <http://www.mpii.de/yago/resource/John_Doe> PREFIX foaf: <http://xmlns.com/foaf/0.1/name> yago:John_Doe foaf:name “John Doe” Subject Predicate ObjectJuly 4th, 2011 Query load aware partitioning of RDF data 8/37
    • RDF and triple stores ● Data in a triple store can be seen as data graph or a huge 3-columns relation. yago:John_Doe Subject Predicate Object foaf:namefoaf:knows yago:John_Doe foaf:name “John Doe” “John Doe” yago:John_Doe foaf:knows yago:Max_Mustermann yago:Max_Mustermann foaf:name “Max Mustermann”yago:Max_Mustermann foaf:name “Max Mustermann” ● Existing solutions like Jena or Sesame use some variation of the 3-columns relation.July 4th, 2011 Query load aware partitioning of RDF data 9/37
    • How to query RDF? ● Use of data graph abstraction. – SQL designed for relational databases ● SPARQL defines queries as subgraphs patterns to be matched within the data graph. a yago:John_Doe foaf:PersonPREFIX foaf: <http://xmlns.com/foaf/0.1/>PREFIX yago: <http://www.mpii.de/yago/resource> foaf:nameSELECT ?name foaf:knowsWHERE { a ?person a foaf:Person . “John Doe” ?person foaf:knows yago:Max_Mustermann . ?person foaf:name ?name . yago:Max_Mustermann} foaf:name “Max Mustermann”July 4th, 2011 Query load aware partitioning of RDF data 10/37
    • Outline ● Motivation & background ● Fragmentation in databases ● Observations & goals ● Proposed methodology ● Preliminary resultsJuly 4th, 2011 Query load aware partitioning of RDF data 11/37
    • Fragmentation in databases ● Why? To exploit processing power of multiples nodes by decomposing operations into parallel sub-operations. ● In relational databases: – Horizontal fragmentation [Dimovski, 2010] – Vertical fragmentation [Hoffer 1975] – Workload driven [Curino, 2010] ● It has to be combined with an allocation strategy (assignment of fragments to hosts)July 4th, 2011 Query load aware partitioning of RDF data 12/37
    • Horizontal & vertical fragmentation Subject Predicate Object yago:John_Doe foaf:name “John Doe” yago:Max_Mustermann foaf:name “Max Mustermann” yago:John_Doe foaf:knows yago:Max_Mustermann yago:John_Doe foaf:mbox “jdoe@wherever.com” yago:Juan_Perez foaf:mbox “jprz@wherever.com” Horizontal or tuple based fragmentation Vertical or column based fragmentation Subject Predicate Object Subject Objectyago:John_Doe foaf:name “John Doe” yago:John_Doe “John Doe”yago:Max_Mustermann foaf:name “Max Mustermann” yago:Max_Mustermann “Max Mustermann” Subject Predicate yago:John_Doe yago:Max_Mustermann yago:John_Doe foaf:name Subject Predicate Object yago:John_Doe “jdoe@whatever.com” yago:Max_Mustermann foaf:nameyago:John_Doe foaf:knows yago:Max_Mustermann yago:Juan_Perez “jprz@wherever.com” yago:John_Doe foaf:knows Subject Predicate Object yago:John_Doe foaf:mboxyago:John_Doe foaf:mbox “jdoe@wherever.com” yago:Juan_Perez foaf:mboxyago:Juan_Perez foaf:mbox “jprz@wherever.com”July 4th, 2011 Query load aware partitioning of RDF data 13/37
    • Workload-driven fragmentation ● Relationships between tuples as a graph. – A node per tuple. They share an edge if they are required by the same transaction. ● Partition the graph ● Try to keep transactions as local as possibleJuly 4th, 2011 Query load aware partitioning of RDF data 14/37
    • Outline ● Motivation & background ● Fragmentation in databases ● Observations & goals ● Proposed methodology ● Preliminary resultsJuly 4th, 2011 Query load aware partitioning of RDF data 15/37
    • Observations ● RDF query load – Updates and insertions are rare – Join oriented ● Data graph – Subjects more selective than objects which are more selective than predicates. – Constants unstable for fragmentation. ● Distributed Query Processing – Communication costs dominate distributed transactionsJuly 4th, 2011 Query load aware partitioning of RDF data 16/37
    • Goals ● Fragment RDF dataset based on a workload to guarantee: ● Small latency – Limit communication costs by maximizing local transactions but keeping parallelism ● High throughput ● Scalability ● Load balancing – Allocate fragments such that hosts get approximately the same load.July 4th, 2011 Query load aware partitioning of RDF data 17/37
    • Outline ● Motivation & background ● Fragmentation in databases ● Observations & goals ● Proposed methodology ● Preliminary resultsJuly 4th, 2011 Query load aware partitioning of RDF data 18/37
    • Proposed methodology Partitioning phase Determine a complete and non-redundant fragmentation of the triple store using a minimal set of predicates extracted from the query load. Allocation phase Assign fragments to hosts to guarantee load balancingJuly 4th, 2011 Query load aware partitioning of RDF data 19/37
    • Normalizing the query load ● Extract independent sub-queries. – We still want independent subqueries to run in parallel. ● Normalize triple patterns: – Turn infrequent URIs or literals into variables. – Capture patterns of access – Not applicable to data types with a reduced value space (e.g xsd:boolean = {true, false})July 4th, 2011 Query load aware partitioning of RDF data 20/37
    • Normalizing the query load PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT ?name WHERE{ Infrequent literal ?x foaf:name ?name . ?x foaf:mbox "alice@wherever.com" } PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT ?name WHERE{ Infrequent literal ?x foaf:name ?name . ?x foaf:mbox "bob@wherever.com" } PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT ?name WHERE{ ?x foaf:name ?name . ?x foaf:mbox ?mbox }July 4th, 2011 Query load aware partitioning of RDF data 21/37
    • Extracting predicatesPREFIX foaf: <http://xmlns.com/foaf/0.1/>PREFIX yago: <http://www.mpii.de/yago/..>SELECT ?name P1: Predicate = foaf:name from AWHERE{ P2: Predicate = foaf:mbox from B A ?x foaf:name ?name . P3: Predicate = foaf:knows from C B ?x foaf:mbox ?mbox P4: Object = yago:John_Doe from C}SELECT ?nameWHERE{ A ?z foaf:name ?name . B ?z foaf:mbox ?mbox . ● Remember where the} C ?z foaf:knows yago:John_Doe . predicates come from. A: ?x foaf:name ?name Freq: 2 ● Store join relationships 1 C: ?x foaf:knows f:John_Doe 2 between patterns Freq: 1 among the queries: 1 B: ?x foaf:mbox ?mbox Freq: 2 Global Query Graph July 4th, 2011 Query load aware partitioning of RDF data 22/37
    • Minterms & Fragments ● Conjunctive expressions over the a set of predicates. e.g : Minterm 00 = ~P ^ ~P 1 2 P1: Predicate = foaf:name Minterm 01 = ~P1 ^ P2 P2: Predicate = foaf:mbox Minterm 10 = P1 ^ ~P2 Minterm 11 = P1 ^ P2 ● A minterm defines a fragment. – Set of triples satisfying the logical function ● The set of all possible minterms determines a non-redundant and complete fragmentation. – But we want a minimal set of predicates.July 4th, 2011 Query load aware partitioning of RDF data 23/37
    • Optimal Horizontal Fragmentation ● A predicate is redundant if the fragmentation is insensitive to its presence or absence. ● Start with an empty set ● For every extracted predicate: – Add it to the set and fragment the database building the minterms – If the fragment is redundant, ignore it. – If not redundant, check if it did not make previously added predicates redundant.July 4th, 2011 Query load aware partitioning of RDF data 24/37
    • Optimal Horizontal Partitioning P1: Predicate = foaf:name P2: Predicate = foaf:mbox Minterm 00: Predicate != foaf:mbox AND Predicate != foaf:name Minterm 01: Predicate != foaf:mbox AND Predicate = foaf:name Minterm 10: Predicate = foaf:mbox AND Predicate != foaf:name Minterm 11: Predicate = foaf:mbox AND Predicate = foaf:name ● The algorithm is O(n2) in the number of predicates. ● Even though there is an exponential number of minterms, many will be not satisfiable.July 4th, 2011 Query load aware partitioning of RDF data 25/37
    • Optimal Horizontal Partitioning P1: Predicate = foaf:name P2: Predicate = foaf:mbox Minterm 00: Predicate != foaf:mbox AND Predicate != foaf:name Minterm 01: Predicate != foaf:mbox AND Predicate = foaf:name Minterm 10: Predicate = foaf:mbox AND Predicate != foaf:name Minterm 11: Predicate = foaf:mbox AND Predicate = foaf:name ● The algorithm is O(n2) in the number of predicates. ● Even though there is an exponential number of minterms, many will be not satisfiable.July 4th, 2011 Query load aware partitioning of RDF data 26/37
    • Optimal Horizontal Partitioning P1: Predicate = foaf:name P2: Predicate = foaf:mbox Minterm 00: Predicate != foaf:mbox AND Predicate != foaf:name Minterm 01: Predicate = foaf:name Minterm 10: Predicate = foaf:mbox ● The algorithm is O(n2) in the number of predicates. ● Even though there is an exponential number of minterms, many will be not satisfiable.July 4th, 2011 Query load aware partitioning of RDF data 27/37
    • Allocating the fragments ● Fragments have access frequencies derived from their provenance and might join in the query load. A: ?x foaf:name ?name Freq: 2 Minterm 01: Predicate = foaf:name from A 1 2 Minterm 10: Predicate = foaf:mbox from B C: ?x foaf:knows f:John_Doe Freq: 1 1 B: ?x foaf:mbox ?mbox Freq: 2 ● Allocate fragments to hosts so that: – They are in the same host if they can join in the query load. – Hosts receive approximately the same load.July 4th, 2011 Query load aware partitioning of RDF data 28/37
    • Allocating the fragments ● Sort fragments descendent by load ● For every fragment, calculate the benefit of assigning it to every host. TL T L=∑ F g×S g U H = g n UH benefit  f , H = × ∑ [ E f , g1] U H CL H g∈H F g=Size of fragment g ; F g =Frequency of access of fragment g n=number of hosts benefit  f , H =Benefit of assigning fragment f to host H CL H =Current load for host H (from fragments assigned so far) E g , j =Weight between fragments f and g in the global query load graph ● Assign it to the most beneficial hostJuly 4th, 2011 Query load aware partitioning of RDF data 29/37
    • Outline ● Motivation & background ● Fragmentation in databases ● Observations & goals ● Proposed methodology ● Preliminary resultsJuly 4th, 2011 Query load aware partitioning of RDF data 30/37
    • Evaluating query complexity ● Local query graph PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX yago: <http://www.mpii.de/yago/..> SELECT ?name WHERE{ ?z foaf:name ?name . ?z foaf:name ?name ?z foaf:mbox ?mbox . ?z foaf:knows yago:John_Doe . } ?z foaf:knows f:John_Doe ?z foaf:mbox ?mboxJuly 4th, 2011 Query load aware partitioning of RDF data 31/37
    • Evaluating the fragmentation ● Distributed query graph for a query Q obtained from global query graph + fragments definition + Q query graph – Fragment 10 Fragment 01 Predicate = foaf:mbox Predicate = foaf:name Relevant to B Relevant to A Host 1PREFIX foaf: <http://xmlns.com/foaf/0.1/> Fragment 00PREFIX yago: <http://www.mpii.de/yago/..> Predicate != foaf:mbox ANDSELECT ?name Predicate != foaf:nameWHERE{ Relevant to C A ?z foaf:name ?name . Host 2 B ?z foaf:mbox ?mbox . Remote edge C ?z foaf:knows yago:John_Doe .} Local edgeJuly 4th, 2011 Query load aware partitioning of RDF data 32/37
    • Preliminary results ● Metrics to evaluate query complexity: – Number of edges in local query graph – Number of remote edges in the distributed query graph ● Metrics to evaluate fragmentation quality for a query – Number of local edges in the distributed query graph – Number of hosts required to answer the queryJuly 4th, 2011 Query load aware partitioning of RDF data 33/37
    • Preliminary results Number of hosts: 5 Run # Dataset Dataset description File size # triples #sub # (MB) queries predicates 1 Subset Dbpedia Dbpedia foaf information 136 1745624 10 9 (names and dates) 2 Subset Dbpedia Dbpedia foaf information 136 1745624 10 10 (names and dates) 3 YAGO Core YAGO Core databaset 2662.4 26227687 9 21 4 YAGO sample RDF-3x YAGO dump 3276.8 35238246 19 35 sample Edge count in local query graphs vs Number of contacted hosts Local edges vs remote edges in Distributed Query Graph Per run of the algorithm per run of the algorithm 2.5 2 1.8 2 Average local 1.6 edges in Hosts contacted 1.4 Distributed Query 1.5Average Edges in local query 1.2 Graph graph Average remote 1 1 edges in Distributed Query 0.8 Graph 0.5 0.6 0.4 0 1 2 3 4 0.2 Runs 0 1 2 3 4 July 4th, 2011 Query load aware partitioning of RDF data 34/37
    • Conclusions ● Use of standard techniques from relational databases ● Method independent from actual storage implementation. – Huge 3-columns table abstraction ● It can be easily extended to support redundancy. ● Applicable to evolving query loads – By changing the level of constants normalizationJuly 4th, 2011 Query load aware partitioning of RDF data 35/37
    • Future work ● Evaluate quality of partitioning – Using real execution costs: need of a distributed index + query planner + distributed cost model – Against other approaches (e.g fragmentation by predicate) ● Evaluate greedy allocation algorithm – Against optimal solution, round robin, etc.. ● Use of estimates for fragment sizes – So far extracted via queries.July 4th, 2011 Query load aware partitioning of RDF data 36/37
    • Thanks for your attentionJuly 4th, 2011 Query load aware partitioning of RDF data 37/37