Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

LargeRDFBench

36 views

Published on

LargeRDFBench: A billion triples benchmark for SPARQL endpoint federation, ISWC 2018, Semantic Web Journal

Published in: Science
  • Be the first to comment

  • Be the first to like this

LargeRDFBench

  1. 1. Muhammad Saleem , Ali Hasnai, Axel-Cyrille Ngonga Ngomo AKSW, University of Leipzig, Germany DICE, University of Paderborn, Germany INSIGHT, University of Galway, Ireland 1
  2. 2.  Federated Benchmark Design Features  Why LargeRDFBench?  Evaluation and results 2
  3. 3.  Datasets  Queries  Performance metrics  Execution rules 3
  4. 4. Datasets used in the federation benchmark should vary:  Number of triples  Number of classes  Number of resources  Number of properties  Number of objects  Average properties per class  Average instances per class  Average in-degree and out-degree  Structuredness or coherence 4
  5. 5.  Number of triple patterns  Number of join vertices  Mean join vertex degree,  Number of sources span  Query result set sizes  Mean triple pattern selectivity  BGP-restricted triple pattern selectivity  Join-restricted triple pattern selectivity  Join vertex types (`star', `path', `hybrid', `sink')  SPARQL clauses used (e.g., LIMIT, UNION, OPTIONAL, FILTER etc.) 5
  6. 6.  Result set completeness and correctness  Number of sources selected  Number of SPARQL ASK requests used during source selection  Source selection time  Number of endpoint requests  Number of intermediate results  Overall query execution time 6
  7. 7.  SPARQL query federation benchmark  13 interconnect real datasets  4 Life sciences  6 Cross domain  3 Large data  40 queries of varying complexities  14 simple (from FedBench)  10 complex  8 large data  8 complex plus high data sources  Multiple performance metrics 7
  8. 8. 8 Why LargeRDFBench?
  9. 9. 9 0 0.2 0.4 0.6 0.8 1 1.2 Structuredness FedBench FedBench-Mean LargeRDFBench LargeRDFBench-Mean  FedBench  Min. = 0.19  Max. = 0.91  STD. = ± 0.26  LargeRDFBench  Min. = 0.19  Max. = 1  STD. = ± 0.28
  10. 10. 10 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 L1 L2 L3 L4 L5 L6 L7 L8 CH1 CH2 CH3 CH4 CH5 CH6 CH7 CH8 #Results(logscale) FedBench FedBench-Mean LargeRDFBench LargeRDFBench-Mean  FedBench  Min. = 1  Max. = 9054  STD. = ± 2397  LargeRDFBench  Min. = 1  Max. = 306705  STD. = ± 104236
  11. 11. 11  FedBench  FedX = 7.4 sec  SPLENDID = 53.4 sec  ANAPSID = 12.4 sec  SemaGrow = 12 sec  CostFed = 0.44 sec  LargeRDFBench (complex queries)  FedX = 246 sec  SPLENDID = 212 sec  ANAPSID = 147 sec  SemaGrow = 367 sec  CostFed = 122 sec  LargeRDFBench (LargeData queries)  Greater than 1 hour for all engines
  12. 12. 12 0 5 10 15 20 25 30 35 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 L1 L2 L3 L4 L5 L6 L7 L8 CH1 CH2 CH3 CH4 CH5 CH6 CH7 CH8 #TriplePatterns FedBench FedBench-Mean LargeRDFBench LargeRDFBench-Mean  FedBench  Min. = 2  Max. = 7  STD. = ± 1.33  LargeRDFBench  Min. = 2  Max. = 33  STD. = ± 6.15
  13. 13. 13 0 2 4 6 8 10 12 14 16 18 20 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 L1 L2 L3 L4 L5 L6 L7 L8 CH1 CH2 CH3 CH4 CH5 CH6 CH7 CH8 #JoinVertices FedBench FedBench-Mean LargeRDFBench LargeRDFBench-Mean  FedBench  Min. = 0  Max. = 5  STD. = ± 1.33  LargeRDFBench  Min. = 0  Max. = 19  STD. = ± 3.63
  14. 14. 14 0 1 2 3 4 5 6 7 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 L1 L2 L3 L4 L5 L6 L7 L8 CH1 CH2 CH3 CH4 CH5 CH6 CH7 CH8 MeanJoinVerticesDegree FedBench FedBench-Mean LargeRDFBench LargeRDFBench-Mean  FedBench  Min. = 2  Max. = 3  STD. = ± 0.3  LargeRDFBench  Min. = 2  Max. = 6  STD. = ± 0.72
  15. 15. 15 0 2 4 6 8 10 12 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 L1 L2 L3 L4 L5 L6 L7 L8 CH1 CH2 CH3 CH4 CH5 CH6 CH7 CH8 #RelevantSources FedBench FedBench-Mean LargeRDFBench LargeRDFBench-Mean  FedBench  Min. = 1  Max. = 4  STD. = ± 0.66  LargeRDFBench  Min. = 1  Max. = 10  STD. = ± 2.1
  16. 16. 16 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 L1 L2 L3 L4 L5 L6 L7 L8 C… C… C… C… C… C… C… C… MeanTriplePatternSelectivity FedBench LargeRDFBench  FedBench  Min. = 0.0011  Max. = 0.3335  STD. = ± 0.11  LargeRDFBench  Min. = 0.00004  Max. = 0.4858  STD. = ± 0.13
  17. 17. 17 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 L1 L2 L3 L4 L5 L6 L7 L8 CH1 CH2 CH3 CH4 CH5 CH6 CH7 CH8 MeanBGP-restrictedTriplePatternSelectivity FedBench LargeRDFBench  FedBench  Min. = 0.0003  Max. = 1  STD. = ± 0.31  LargeRDFBench  Min. = 0.0003  Max. = 1  STD. = ± 0.22
  18. 18. 18 0.000001 0.00001 0.0001 0.001 0.01 0.1 1 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 L1 L2 L3 L4 L5 L6 L7 L8 CH1 CH2 CH3 CH4 CH5 CH6 CH7 CH8 MeanJoin-restrictedTriplePattern Selectivity(logscale) FedBench LargeRDFBench  FedBench  Min. = 0.0  Max. = 0.33  STD. = ± 0.13  LargeRDFBench  Min. = 0.0  Max. = 0.58  STD. = ± 0.15
  19. 19. 19 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 Avg. Time-LogScale(msec) FedX(cold) FedX(100% cached) SPLENDID ANAPSID FedX+HiBISCuS SPLENDID+HiBISCuS FedX+HiBISCu S Fed X SPLENDID+HiBISC uS ANAPSI D SPLENDI D
  20. 20. 20 SPLENDID+HiBISC uS SPLENDI D FedX+HiBISCu S FedX 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Avg. Time-LogScale(msec) FedX(cold) FedX(100% cached) SPLENDID ANAPSID FedX+HiBISCuS SPLENDID+HiBISCuS Runtimeerror Runtimeerror Runtimeerror Runtimeerror Runtimeerror Runtimeerror Timeout Runtimeerror Timeout Timeout Timeout Timeout Timeout Timeout Timeout Runtimeerror ANAPSID
  21. 21. 21 Query FedX(100% cached) SPLENDID ANAPSID FedX+HiBISCuS SPLENDID+HiBISCuS L1 2320 (7.2 %) 16 (2.73 %) 1947 (15.76 %) 2320 (7.2 %) 16 (2.73 %) L2 1 (0 %) 80 (1.8 %) 1609 (ZRT) 1 (0 %) 80 (1.80 %) L3 1 (0 %) 27345725553 (ZRT) 1 (0 %) 2734572 L4 3967 (0.08 %) 16321 (0 %) 11290 (0 %) 15721 (48.34 %) 16220 (0 %) L5 1 (0 %) 28342 (ZRE) 3840 (ZRT) 1 (0 %) 28212 (ZRE) L6 3830809 (ZRT) 61810 (ZRE) 1707 (ZRT) 3414400 (0 %) 61419 (ZRE) L7 74387 867628 267 23381 1341384 L8 206859 (0.01 %) 2423783 (0.05 %) 17302 (0.05 %) 206859 (0.01 %) 2423783 (0.05 %) ZRT = Zero Results after Timeout ZRE = Zero Results with Runtime Error
  22. 22. 22 Query CostFed FedX SPLENDID ANAPSID FedX+HiBISCuS SPLENDID+HiBISCuS SemaGrow CH1 800 6321Runtime Error Timeout 4109Runtime Error 277435 CH2 Runtime Error 3282233Runtime Error 101517 778563Runtime Error 144326 CH3 Zero Result 275412Runtime Error Zero Results 329223Runtime Error 4660 CH4 Runtime Error Zero Results Runtime Error 27544Zero Results Runtime Error 10606 CH5 52875Timeout 15122Parse Error Runtime Error 15122Timeout CH6 173 274Runtime Error 7737Runtime Error Runtime Error Timeout CH7 5647Timeout Runtime Error 2480Runtime Error Runtime Error Timeout CH8 Timeout Timeout Runtime Error Timeout Runtime Error Runtime Error Timeout Most engines are somehow unstable
  23. 23. 23 FedX(Warm) SPLENDID ANAPSID FedX+HiBISCuS SPLENDID+HiBISCuS #Triple Patterns 0.537 0.453 0.621 0.492 #Sources Span 0.233 0.232 0.245 0.019 0.290 #Results 0.583 0.553 0.085 0.534 0.476 #Join Vertices 0.275 0.289 0.214 0.301 0.284 Mean Join Vertex Degree 0.500 0.210 0.226 0.382 0.183 Mean TP Selectivity 0.261 0.304 0.198 0.237 0.263 Mean BGP-restricted TP Sel -0.065 -0.022 -0.190 -0.014 -0.042 Mean Join-restricted TP Sel 0.654 -0.334 -0.224 -0.472 -0.441  Results are significant at 1% level  Results are significant at 5% level  Results are significant at 10% level Most influential Features # Triple Pattern Result size Join-restricted TP selectivity Number of join vertices Mean Join vertex degree Mean TP selectivity Number of sources span BGP-restricted TP selectivity
  24. 24.  Simple queries benchmarks are not sufficient to perform a fair comparison of federation engines  Positioning of federation engines greatly changes from Simple to Complex queries  Federation engines are unstable when exposed to Large Data or Complex + High Sources queries  Number of triple patterns, Result Size, and Join-restricted TP selectivity are the three most influential query features  Smaller number of endpoints requests does not necessary mean smaller execution time 24
  25. 25. This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
  26. 26. Thanks ! saleem@informatik.uni-leipzig.de 26
  27. 27. 27
  28. 28. SPARQL ENDPOINTS QUERY FEDERATION 28 Endpoint 1 Endpoint 2 Endpoint 3 Endpoint 4 RDF RDF RDF RDF Parsing Source Selection Federat or Optimz er Integrator Rewrite query and get Individual Triple Patterns Identify capable source against Individual Triple Patterns Generate optimized sub-query Exe. Plan Integrate sub- queries results Execute sub- queries Federati on Engine
  29. 29. QUERIES AS HYPERGRAPHS 29

×