DAW: Duplicate-AWare Federated Query Processing over the Web of Data

860 views

Published on

DAW: Duplicate-AWare Federated Query Processing over the Web of Data presented at ISWC2013 research track.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
860
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

DAW: Duplicate-AWare Federated Query Processing over the Web of Data

  1. 1. DAW: Duplicate-AWare Federated Query Processing over the Web of Data Muhammad Saleem1 , Axel-Cyrille Ngonga Ngomo1, Josiane Xavier Parreira2 , Helena F. Deus3 , Manfred Hauswirth2 1Agile Knowledge Engineering and Semantic Web (AKSW), University of Leipzig, Germany lastname@informatik.uni-leipzig.de 2Digital Enterprise Research Institute(DERI), National University of Ireland.,Galway firstname.lastnameg@deri.org International Semantic Web Conference (ISWC), October 21-25 , 2013, Sydney, Australia
  2. 2. Motivation S1 S2 S3 S4 RDF RDF RDF RDF Parser Source Selection Federator Optimzer Integrator Get Individual Triple Patterns Identify capable source against Individual Triple Patterns Generate optimized sub- query Exe. Plan Integrate sub- queries results Execute sub- queries
  3. 3. Motivation SELECT ?v1 ?v2 WHERE { ?uri <p1> ?v1. // Triple Pattern 1 (TP1) ?uri <p2> ?v2. // Triple Pattern 2 (TP2) } S1 RDF Source Selection Algorithm S2 RDF S3 RDF S4 RDF Triple pattern-wise source selection S1 S2 S3TP1 = S4TP2 = S2S1 Total triple pattern-wise selected sources = 6
  4. 4. Motivation Retrieved results for TP1 (?uri <p1> ?v1) Retrieved results for TP2 (?uri <p2> ?v2) Triple pattern-wise source selection and skipping S1 S2 S3TP1 = Total triple pattern-wise selected sources = 4 S1 S2TP2 = S4 Min. number of new triples (threshold) = 20 Total triple pattern-wise skipped sources = 2
  5. 5. Problem Statement • Data duplication in LOD datasets – E.g. DrugBank and Neurocommons are duplicated at DERI health Care and Life Sciences Knowledge Base • Duplicate results retrieval increase the query execution time and network traffic • How to estimate the overlap between data sources before sub-queries federation?
  6. 6. Sketches • Data structures that provide dataset summaries – Min-wise Independent Permutations (MIPs) – Bloom filters • Estimate overlap among different ID sets • MIPs provide good tradeoff between estimation error and space requirements • MIPs of different lengths can be compared • Sketches all alone cannot be used in SPARQL federation – SPARQL queries are highly selective when subject, predicate, or object becomes bound in a triple pattern
  7. 7. Min-wise Independent Permutations 48 24 36 18 820 21 3 12 24 877 9 21 15 24 4640 21 18 45 30 339 h1 = (7x + 3) mod 51 h2 = (5x + 6) mod 51 hN = (3x + 9) mod 51 8 9 9 Apply Permutations to all ID’s ID set Create MIP Vector from Minima of Permutation s 8 9 30 24 36 9 8 24 20 48 36 13 MIPs estimated operations h(concat(s,o)) T4(s,p,o) T5(s,p,o) T6(s,p,o) T1(s,p,o) T2(s,p,o) T3(s,p,o) Triples VA VB 8 9 20 24 36 9 Union (VA , VB) Resemblance (VA , VB ) = 2/6 => 0.33 Overlap (VA , VB ) = 0.33*(6+6) / (1+0.33) => 3 hi = ai∗x + bimod U 𝑅𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 (𝑆𝐴, 𝑆 𝐵) = 𝑆 𝐴⋂𝑆𝐵 𝑆 𝐴⋃𝑆𝐵 ≈ |VA⋂VB| 𝑁 Overlap (𝑆𝐴, 𝑆 𝐵)≈ 𝑅𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 𝑉 𝐴 ,𝑉 𝐵 ×( 𝑆 𝐴 + 𝑆 𝐵 ) (𝑅𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 𝑉 𝐴 ,𝑉𝐵 +1) 𝐸𝑟𝑟𝑜𝑟 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 = 𝑂(1 𝑁) 𝑆′𝑖 = 𝑆𝑖 𝑖𝑓 𝑛𝑒𝑖𝑡ℎ𝑒𝑟 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑛𝑜𝑟 𝑜𝑏𝑗𝑒𝑐𝑡 𝑖𝑠 𝑏𝑜𝑢𝑛𝑑 𝑆𝑖 × 𝑎𝑣𝑔𝑆𝑏𝑗𝑆𝑒𝑙 𝑆 𝑝 𝑖𝑓 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑖𝑠 𝑏𝑜𝑢𝑛𝑑 𝑆𝑖 × 𝑎𝑣𝑔𝑂𝑏𝑗𝑆𝑒𝑙 𝑆 𝑝 𝑖𝑓𝑜𝑏𝑗𝑒𝑐𝑡 𝑖𝑠 𝑏𝑜𝑢𝑛𝑑
  8. 8. DAW • A combination of MIPs with compact data summaries • Use average selectivities values for bound subject and objects • Can be combined with any existing SPARQL endpoint federation system • Can be used for partial result retrieval
  9. 9. DAW Index [] a sd:Service ; sd:endpointUrl <http://localhost:8890/sparql> ; sd:capability [ sd:predicate diseasome:name ; sd:totalTriples 147 ; sd:avgSbjSel ``0.0068'' ; sd:avgObjSel ``0.0069'' ; sd:MIPs ``-6908232 -7090543 -6892373 -7064247 ...''; ] ; sd:capability [ sd:predicate diseasome:chromosomalLocation ; sd:totalTtriples 160 ; sd:avgSbjSel ``0.0062'' ; sd:avgObjSel ``0.0072'' ; sd:MIPs ``-7056448 -7056410 -6845713 -6966021 ...''; ] ;
  10. 10. Triple Pattern-wise source ranking and skipping
  11. 11. Evaluation Setup Dataset Total Size (MB) Index Size (bytes) No of Slice Discrepancy No of Dup. Slices Index Gen. Time (sec) Diseasome 18.62 0.17 10 1500 1 4 Geo 274.14 1.63 10 50000 2 133 LinkedMDB 448.93 1.66 10 100000 1 201 Publication 39.07 0.2 10 2500 1 6 Queries Distribution Dataset STP S-1 S-2 P-1 P-2 P-3 Total Diseasome 5 5 5 4 5 2 26 Geo 5 5 5 - - - 15 LinkedMDB 5 - - - - - 5 Publication 5 5 5 7 7 4 33 Total 20 15 15 11 12 6 79 EndPoint CPU(GHz) RAM Hard Disk 12.2. i3 4GB 300GB 22.9. i7 16GB 256GB SSD 32.6. i5 4GB 150GB 42.53. i5 4GB 300GB 52.3. i5 4GB 500GB 62.53. i5 4GB 300GB 72.9. i7 8GB 450GB 82.6. i5 8GB 400GB 92.6. i5 8GB 400GB 102.9. i7 16GB 500GB • Slice generator tool [1] for random slicing and duplicates • We have extended FedX, SPLENDID, DARQ with DAW [1] http://goo.gl/trjGSJ
  12. 12. Triple Pattern-wise sources skipped DARQ Dataset STP S-1 S-2 P-1 P-2 P-3 Total Recall Diseasome 14(35) 30(77) 40(107) 35(65) 65(125) 30(50) 214(459) 100% Geo 22(40) 23(55) 37(101) - - -82(196) 99.99% LinkedMDB 22(38) - - - - -22(38) 100% Publication 9(30) 10(37) 15(86) 14(60) 21(120) 32(102) 101(435) 100% Total 67(143) 63(169) 92(294) 49(294) 86(245) 62(152) 419(1128) FedX and SPLENDID Dataset STP S-1 S-2 P-1 P-2 P-3 Total Recall Diseasome 7(28) 30(77) 40(107) 35(65) 65(125) 30(50) 207(452) 100% Geo 19(37) 23(55) 37(101) - - -79(193) 99.99% LinkedMDB 15(31) - - - - -15(31) 100% Publication 3(24) 10(37) 15(86) 14(60) 21(120) 32(102) 95(429) 100% Total 44(120) 63(169) 92(294) 49(125) 86(245) 62(152) 396(1105)
  13. 13. Triple Pattern-wise sources skipped DARQ Dataset STP S-1 S-2 P-1 P-2 P-3 Total Recall Diseasome 14(35) 30(77) 40(107) 35(65) 65(125) 30(50) 214(459) 100% Geo 22(40) 23(55) 37(101) - - -82(196) 99.99% LinkedMDB 22(38) - - - - -22(38) 100% Publication 9(30) 10(37) 15(86) 14(60) 21(120) 32(102) 101(435) 100% Total 67(143) 63(169) 92(294) 49(294) 86(245) 62(152) 419(1128) FedX and SPLENDID Dataset STP S-1 S-2 P-1 P-2 P-3 Total Recall Diseasome 7(28) 30(77) 40(107) 35(65) 65(125) 30(50) 207(452) 100% Geo 19(37) 23(55) 37(101) - - -79(193) 99.99% LinkedMDB 15(31) - - - - -15(31) 100% Publication 3(24) 10(37) 15(86) 14(60) 21(120) 32(102) 95(429) 100% Total 44(120) 63(169) 92(294) 49(125) 86(245) 62(152) 396(1105)
  14. 14. FedX Extension with DAW 0 1 2 3 4 5 6 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP Diseasome Publication Geo Data Movie Executiontime(sec) FedX DAW Over all performance Evaluation Diseasome Publication Geo Data Movie Overall Average Gain % Average Gain % Average Gain % Average Gain % Average Gain % FedX 2.44 18.79 1.48 -12.38 4.60 14.71 1.74 7.59 2.44 9.76 DAW 1.98 1.67 3.92 1.61 2.20
  15. 15. SPLENDID Extension with DAW 0 1 2 3 4 5 6 7 8 9 10 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP Diseasome Publication Geo Movie Executiontime(sec) SPLENDID DAW Over all performance Evaluation Diseasome Publication Geo Data Movie Overall Average Gain % Average Gain % Average Gain % Average Gain % Average Gain % SPLENDID 3.78 19.48 2.18 -8.94 7.27 14.40 1.9 11.16 3.71 11.11 DAW 3.04 2.37 6.22 1.688 3.30
  16. 16. DARQ Extension with DAW 0 5 10 15 20 25 30 35 40 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP Diseasome Publication Geo Movie Executiontime(sec) DARQ DAW Over all performance Evaluation Diseaso me Publicati on Geo Data Movie Overall Average Gain % Average Gain % Average Gain % Average Gain % Average Gain % DARQ 8.27 23.34 5.26 6.14 23.44 16.31 1.96 13.88 9.59 16.46 DAW 6.34 4.94 19.62 1.688 8.01
  17. 17. Source Ranking vs Recall 0 20 40 60 80 100 120 Recallin% Ranked Sources Optimal DAW 0 20 40 60 80 100 120 Recallin% Ranked Sources Optimal DAW Diseasome Publication
  18. 18. Conclusion and Future Work • A sub-query can retrieve results that are already retrieved by another query – Resources are wasted – Query runtime is increased – Extra traffic is generated • Sketches all alone cannot be used due to expressive nature of SPARQL queries • We used MIPs applied to RDF predicates along with compact data summaries • Performance improvement – FedX : 9.76 % – SPLENDID: 11.11 % – DAW: 16.76 % • The effect of MIPs sizes and threshold values to find the optimal trade-off between execution time and recall will be explored saleem@informatik.uni-leipzig.de AKSW, University of Leipzig, Germany

×