A Main Memory Index Structure to Query Linked Data

2,170 views
2,105 views

Published on

That's the slides with which I presented my paper at the Linked Data on the Web workshop (LDOW) at WWW 2011, Hyderabad, India.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,170
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
26
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

A Main Memory Index Structure to Query Linked Data

  1. 1. A Main Memory Index Structure to Query Linked Data Olaf Hartig http://olafhartig.de/foaf.rdf#olaf Frank Huber Database and Information Systems Research Group Humboldt-Universität zu Berlin
  2. 2. The Issue no reuse given 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80 orderContactInfoPhillipe (Query No. 36)UnsetPropsPhillipe (Query No. 37)2ndDegree1Phillipe (Query No. 38)2ndDegree2Phillipe (Query No. 39) IncomingPhillipe (Query No. 40) 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80 hit rate number of query results query execution time (in seconds)Olaf Hartig - A Main Memory Index Structure to Query Linked Data 2
  3. 3. The Issue no reuse given 0 order 0,2 0,4 0,6 0,8 1 Descriptor20 25 30 0 5 10 15 objects 0 20 40 60 80 in the query-localContactInfoPhillipe (Query No. 36) dataset after query execution:UnsetPropsPhillipe (Query No. 37) 1722ndDegree1Phillipe 533 (Query No. 38)2ndDegree2Phillipe (Query No. 39) IncomingPhillipe (Query No. 40) 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80 hit rate number of query results query execution time (in seconds)Olaf Hartig - A Main Memory Index Structure to Query Linked Data 3
  4. 4. query-local Logical representation of dataset Linked Data from the Web Physical representation of Linked Data from the Web ? What data structure do we use to physically represent the query-local dataset?Olaf Hartig - A Main Memory Index Structure to Query Linked Data 4
  5. 5. Outline 1. Requirements + Existing Work 2. Data Structures 3. EvaluationOlaf Hartig - A Main Memory Index Structure to Query Linked Data 5
  6. 6. Requirements ● (Consecutively) build and use ad hoc collections of many small sets of RDF triples ● Four main operations: ● Find … matching triples for a triple pattern in all descriptor objects ● Add, Remove, Replace … descriptor objects ● Support of concurrent access (i.e. isolation) ● Non-relevant properties: ● Querying descriptor objects individually is not necessary ● No need to write data back to the Web ● ACID properties not required for complete queriesOlaf Hartig - A Main Memory Index Structure to Query Linked Data 6
  7. 7. Requirements ● (Consecutively) build and use ad hoc collections of many small sets of RDF triples ● Four main operations: ● Find … matching triples for a triple pattern in all descriptor objects ● Add, Remove, Replace … descriptor objects ● Support of concurrent access (i.e. isolation) ● Non-relevant properties: ● Querying descriptor objects individually is not necessary ● No need to write data back to the Web ● ACID properties not required for complete queriesOlaf Hartig - A Main Memory Index Structure to Query Linked Data 7
  8. 8. Existing Work ● Disk based storage solutions for RDF data ● Unsuitable due to very costly I/O operations ● Main memory based data structures in the literature ● Focus on a large, single set of RDF triples ● Optimized for complete graph pattern queries or path queries ● Main memory based data structures in RDF frameworks ● Focus on Jena, ARQ and NG4J ● Inefficient (see evaluation)Olaf Hartig - A Main Memory Index Structure to Query Linked Data 8
  9. 9. Outline 1. Requirements + Existing Work 2. Data Structures 3. EvaluationOlaf Hartig - A Main Memory Index Structure to Query Linked Data 9
  10. 10. Hash-Based Index for RDF Data Logical representation Physical representation SP PO SO ● Dictionary: ● Two-way mapping between RDF Dict terms and numerical identifiers S P O ● 6 hash tables: ● Each hash table contains all ID-encoded triples ● Efficient support for all types of triple patterns *Similar to Harth and Decker, 2005Olaf Hartig - A Main Memory Index Structure to Query Linked Data 10
  11. 11. Hash-Based Index for RDF Data Logical representation Physical representation SP PO SO ● Dictionary: ● Two-way mapping between RDF Dict terms and numerical identifiers S P O ● 6 hash tables: ● Each hash table contains t = ( id ,id ,id ) all ID-encoded triples id s p o ● Efficient support for all types of triple patterns *Similar to Harth and Decker, 2005Olaf Hartig - A Main Memory Index Structure to Query Linked Data 11
  12. 12. Hash-Based Index for RDF Data Logical representation ?acq know Find s Physical representation http://bob.name SP PO SO ● Dictionary: ● Two-way mapping between RDF Dict terms and numerical identifiers S P O ● 6 hash tables: ● Each hash table contains t = ( id ,id ,id ) all ID-encoded triples id s p o ● Efficient support for all types of triple patterns *Similar to Harth and Decker, 2005Olaf Hartig - A Main Memory Index Structure to Query Linked Data 12
  13. 13. Individual Indexing query-local dataset Logical representation Physical representation SP PO SO SP PO SO S P O Dict S P O SP PO SO ● Idea: Index each descriptor object S P O separately ● Implementation of the four operations: ● Add, Remove, and Replace are straightforward ● Find requires iterating over all indexesOlaf Hartig - A Main Memory Index Structure to Query Linked Data 13
  14. 14. Individual Indexing ?acq know Find s query-local http://bob.name dataset Logical representation Physical representation SP PO SO SP PO SO S P O Dict S P O SP PO SO ● Idea: Index each descriptor object S P O separately ● Implementation of the four operations: ● Add, Remove, and Replace are straightforward ● Find requires iterating over all indexesOlaf Hartig - A Main Memory Index Structure to Query Linked Data 14
  15. 15. Combined Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single index for all descriptor objects Dict S P O ● src – maps each triple to a set of descriptor object IDsOlaf Hartig - A Main Memory Index Structure to Query Linked Data 15
  16. 16. Combined Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single index for all descriptor objects Dict S P O ● src – maps each triple to a set of descriptor object IDs tid = ( ids,idp,ido )Olaf Hartig - A Main Memory Index Structure to Query Linked Data 16
  17. 17. Combined Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single index for all descriptor objects Dict S P O ● src – maps each triple to a set of descriptor object IDs tid = ( ids,idp,ido ) + src( tid ) = { , }Olaf Hartig - A Main Memory Index Structure to Query Linked Data 17
  18. 18. Quad Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single quad index for all descriptor objects Dict S P O ● quad = ID-encoded triple + descriptor object ID q = ( (ids,idp,ido) , )Olaf Hartig - A Main Memory Index Structure to Query Linked Data 18
  19. 19. Outline 1. Requirements + Existing Work 2. Data Structures 3. EvaluationOlaf Hartig - A Main Memory Index Structure to Query Linked Data 19
  20. 20. Experiment Setup Does this affect the overall execution time for link traversal based query executions ?Olaf Hartig - A Main Memory Index Structure to Query Linked Data 20
  21. 21. Experiment Setup Does this affect the overall execution time for link traversal based query executions ? ● Simulation of the Web of Data ● Linked Data server publishes BSBM dataset (scal. factor: 50) ● Adjusted BSBM queries link to the simulation server ● Experiment: ● Sequence of 200 query mixes ● Reuse of the query-local dataset for the whole sequence ● IndIR, CombIR, and QuadIR (as presented), engine: SQUIN ● NamedGraphSetImpl (NG4J/Jena), engine: SemWeb ClientOlaf Hartig - A Main Memory Index Structure to Query Linked Data 21
  22. 22. Execution Time 2500 80overall number of descr.objects in the queried dataset NG4J (SWClLib 70 ) 2000 IndIR, m=4 60 CombIR, m=12 CombQuadIR, execution time in seconds m=12 50 1500 40 1000 30 20 500 10 0 0 0 40 80 120 160 200 0 20 40 60 80 100 120 140 160 180 200 query mix query mix Olaf Hartig - A Main Memory Index Structure to Query Linked Data 22
  23. 23. Execution Time 2500 80overall number of descr.objects in the queried dataset NG4J (SWClLib 70 ) 2000 IndIR, m=4 60 CombIR, m=12 CombQuadIR, execution time in seconds m=12 50 1500 40 1000 30 20 500 10 0 0 0 40 80 120 160 200 0 20 40 60 80 100 120 140 160 180 200 query mix query mix Olaf Hartig - A Main Memory Index Structure to Query Linked Data 23
  24. 24. Summary ● Three hash index based data structures: ● Individually indexing ● Combined indexing ● Quad indexing ● Findings: ● A single index improves query performance significantly ● Smaller load times with quads ● Also for other use cases of ad hoc storing of Linked Data ● Consecutively retrieved from remote sources ● Used for immediate local processingOlaf Hartig - A Main Memory Index Structure to Query Linked Data 24
  25. 25. Backup SlidesOlaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 25
  26. 26. Combined Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single index for all descriptor objects Dict S P O ● src – maps each triple to a set of descriptor object IDs ● Implementation of the four operations requires: ● status – maps each descriptor object ID to a status ( BeingIndexed, Indexed, ToBeRemoved, BeingRemoved ) ● Find reports a triple t only if: ∃d ∈ src(t) : status(d) = IndexedOlaf Hartig - A Main Memory Index Structure to Query Linked Data 26
  27. 27. Implementation ● Available as Free Software (http://squin.org) ● Hash tables with n = 2m buckets ● Hash functions: ● hS( is, ip, io ) = is & bitmask[m] ● hSP( is, ip, io ) = ( is • ip ) & bitmask[m] ● etc. ● Which m ?* * see paper ● m = 4 for the individual indexes ● m = 12 for the combined index ● m = 12 for the quad indexOlaf Hartig - A Main Memory Index Structure to Query Linked Data 27
  28. 28. Experiment Setup ● Comparison without link traversal based query execution ● Compared data structures: ● Our implementation of IndIR, CombIR, CombQuadIR ● NamedGraphSetImpl in NG4J (Jena) ● DatasetGraphMap in ARQ (Jena) ● Berlin SPARQL Benchmark (BSBM) ● BSBM datasets partitioned into query-local datasets ● BSBM (v2.0) query mixes executed over these datasetsOlaf Hartig - A Main Memory Index Structure to Query Linked Data 28
  29. 29. Required Memory 120 ARQ NG4J IndIR (m=4) 100Estimated required memory in MB CombIR (m=12) CombQuad (m=12) 80 60 BSBM number of overall 40 scaling descriptor number factor objects of triples 50 2,599 22,616 100 4,178 40,133 20 150 5,756 57,524 200 7,329 75,062 0 250 9,873 97,613 0 100 200 300 400 500 300 11,455 115,217 pc (BSBM scaling factor) 350 13,954 137,567 500 18,687 190,502 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 29
  30. 30. Execution Time 2500average execution time per query mix in seconds ARQ NG4J IndIR (m=4) 2000 CombIR (m=12) CombQuad (m=12) 1500 1000 BSBM number of overall scaling descriptor number factor objects of triples 50 2,599 22,616 500 100 4,178 40,133 150 5,756 57,524 200 7,329 75,062 0 250 9,873 97,613 0 100 200 300 400 500 300 11,455 115,217 pc (BSBM scaling factor) 350 13,954 137,567 500 18,687 190,502 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 30
  31. 31. Load Time 10 CombIR (m=12) 9 IndIR (m=4) CombQuad (m=12) 8average creation time in seconds 7 6 5 4 BSBM number of overall scaling descriptor number 3 factor objects of triples 50 2,599 22,616 2 100 4,178 40,133 150 5,756 57,524 1 200 7,329 75,062 250 9,873 97,613 0 0 100 200 300 400 500 300 11,455 115,217 pc (BSBM scaling factor) 350 13,954 137,567 500 18,687 190,502 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 31
  32. 32. Execution Time 80 60 NG4J (SWClLib) 70 50 IndIR, m=4 CombIR, m=12 CombIR, m=12 CombQuadIR CombQuadIR, m=12 60 40 , m=12execution time in seconds 50 30 40 20 30 10 20 0 0 5 10 15 20 25 10 query mix 0 0 20 40 60 80 100 120 140 160 180 200 query mix Olaf Hartig - A Main Memory Index Structure to Query Linked Data 32
  33. 33. These slides have been created by Olaf Hartig http://olafhartig.de This work is licensed under a Creative Commons Attribution-Share Alike 3.0 License (http://creativecommons.org/licenses/by-sa/3.0/)Olaf Hartig - A Main Memory Index Structure to Query Linked Data 33

×