Muhammad Saleem , Claus Stadler, Qaiser Mehmood, Jens
Lehmann, Axel-Cyrille Ngonga Ngomo
(K-Cap 2017, Austin, USA)
AKSW, University of Leipzig, Germany
DICE, University of Paderborn,
Germany
SDA, University of Bonn, Germany
1
 Query containment
 Why SQCFramework?
 SQCFramework
 Input queries
 Important query features
 Benchmark generation
 Benchmark personalization
 Evaluation and results
 Conclusion
2
Deciding whether the result set of one query is
included in the result set of another?
3
Formally:
 Query optimization
 Caching mechanisms
 Data integration
 View maintenance
 Query rewriting
4
 Real data
 Real log queries
 Flexible
 Customizable
 Use-case specific
5
6
SPARQ
L
queries
Selection
criteria
Containment
benchmark
1. Selection of super-queries
2. Normalization of feature vectors
3. Generation of clusters
4. Selection of most representative queries
 Manually provided by user
 Selection from LSQ
 Linked SPARQL Queries datasets
 Extracted from endpoint queries log
 Structural and data-driven statistics
7
20 datasets available from (http://hobbitdata.informatik.uni-leipzig.de/lsq-dumps/)
 Number of entailments/sub-queries
 Number of projection variables
 Number of BGPs
 Number of triple patterns
 Max. number BGP triple patterns
 Min. number BGP triple patterns
 Number of join vertices
 Mean join vertex degree
 Number of LSQ features
8
1. Selection of super-queries
2. Normalized feature vectors
3. Generation of clusters
4. Selection of most representative queries
9
10
11
2
2
1
5
5
5
3
2.3
2
 Number of entailments/sub-queries
 Number of projection variables
 Number of BGPs
 Number of triple patterns
 Max. number BGP triple patterns
 Min. number BGP triple patterns
 Number of join vertices
 Mean join vertex degree
 Number of LSQ features
10
8
6
12
5
10
10
5
30
0.2
0.25
0.16
0.41
1
0.5
0.33
0.46
0.06
Feature vector Max. feature vector Normalized feature vector
F M F/M
 FEASIBLE
 FEASIBLE-Exemplars
 KMeans++
 DBSCAN+KMeans++
 Agglomerative
 Random selection
12
13
Plot normalized feature vectors in a multidimensional space
Query F1 F2
Q1 0.2 0.2
Q2 0.5 0.3
Q3 0.8 0.3
Q4 0.9 0.1
Q5 0.5 0.5
Q6 0.2 0.7
Q7 0.1 0.8
Q8 0.13 0.65
Q9 0.9 0.5
Q10 0.1 0.5
Suppose we need a benchmark of 3 queries
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
14
15
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
Avg.
Avg.
Avg.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Calculate Average across each cluster
16
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
Avg.
Avg.
Avg.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Calculate distance of each point in cluster to the average
17
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
Avg.
Avg.
Avg.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Select minimum distance query as the final benchmark
query from that cluster
Purple, i.e., Q2 is the final selected query from yellow cluster
 Number of projection variables in the super-
queries should be at most 2
 Number of BGPs should be greater than 1
or the number of triple patterns should be
greater than 3
 Benchmark should be selected from the
most recently executed 1000 queries
18
19
 Similarity error
 Diversity score
L is the query log, B is the benchmark,
and k is the set of all features
 We compared
 FEASIBLE
 FEASIBLE-Exemplars
 KMeans++
 DBSCAN+KMeans++
 Random selection
 Number of containment tests (#T)
 Benchmark generation time (G) in sec
20
 Query Mixes per Hour (QMpH)
 Number of handled test cases
 Number of timed out test cases
 We compared
 TreeSolver
 AFMU
 SPARQL-Algebra
 JSAC
We generated benchmarks using Semantic Web
Dog Food (SWDF) and DBpedia queries logs
21
0
0.01
0.02
0.03
0.04
0.05
15 25 50 75 100 125
SIMILARITYERROR
#SUPER QUERIES
FEASIBLE KMeans++
DBScan+KMeans++ Random
FEASIBLE-Exemplars
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
2 4 6 9 12 15
SIMILARITYERROR
#SUPER QUERIES
FEASIBLE KMeans++
DBScan+KMeans++ Random
FEASIBLE-Exemplars
(SWDF) (DBpedia)
• Similarity error is inversely (in general) proportional to benchmark size
• Random selection in general generates benchmarks of smaller similarity
errors
22
(SWDF) (DBpedia)
• Diversity score is inversely (in general) proportional to benchmark size
• FEASIBLE-Exemplars generates the more diverse benchmarks
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
15 25 50 75 100 125
DIVERSITYSCORE
#SUPER QUERIES
FEASIBLE KMeans++
DBScan+KMeans++ Random
FEASIBLE-Exemplars
0
0.1
0.2
0.3
0.4
0.5
2 4 6 9 12 15
DIVERSITYSCORE
#SUPER QUERIES
FEASIBLE KMeans++
DBScan+KMeans++ Random
FEASIBLE-Exemplars
23
• Not significant differences
24
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
NormalizedS.D.
SQCFrameWork-FEASIBLE-Exemplars
SQC-Benchmark
• SQCFrameWork-FEASIBLE-Exemplars is more diverse
across majority of the query features
*SQC-Benchmark: http://sparql-qc-bench.inrialpes.fr/
25
• JSAC correctly handled all cases in with reasonable
QMpH
0
0.5
1
1.5
2
QMpH
TreeSolver AFMU
JSAC SPARQL-Algebra
Total
Tests
#Handled
Tests
#Correct
Test
#Timeout
Tests
TreeSolver 1192 5 5 2
AFMU 1192 5 5 12
SPARQL-Algebra 1192 0 0 0
JSAC 1192 1192 1192 0
26
 SQCFramework:
 Based on real data, real log queries
 Flexible
 Customizable
 Use-case specific
 Similarity error is inversely (in general) proportional to benchmark size
 Random selection in general generates benchmarks of smaller similarity errors
 Diversity score is inversely (in general) proportional to benchmark size
 FEASIBLE-Exemplars generates the more diverse benchmarks
 JSAC correctly handled all cases in with reasonable QMpH
 SQCFramework available from (https://github.com/dice-group/sqcframework)
27
Thanks !
saleem@informatik.uni-leipzig.de
28

SQCFramework: SPARQL Query containment Benchmark Generation Framework

  • 1.
    Muhammad Saleem ,Claus Stadler, Qaiser Mehmood, Jens Lehmann, Axel-Cyrille Ngonga Ngomo (K-Cap 2017, Austin, USA) AKSW, University of Leipzig, Germany DICE, University of Paderborn, Germany SDA, University of Bonn, Germany 1
  • 2.
     Query containment Why SQCFramework?  SQCFramework  Input queries  Important query features  Benchmark generation  Benchmark personalization  Evaluation and results  Conclusion 2
  • 3.
    Deciding whether theresult set of one query is included in the result set of another? 3 Formally:
  • 4.
     Query optimization Caching mechanisms  Data integration  View maintenance  Query rewriting 4
  • 5.
     Real data Real log queries  Flexible  Customizable  Use-case specific 5
  • 6.
    6 SPARQ L queries Selection criteria Containment benchmark 1. Selection ofsuper-queries 2. Normalization of feature vectors 3. Generation of clusters 4. Selection of most representative queries
  • 7.
     Manually providedby user  Selection from LSQ  Linked SPARQL Queries datasets  Extracted from endpoint queries log  Structural and data-driven statistics 7 20 datasets available from (http://hobbitdata.informatik.uni-leipzig.de/lsq-dumps/)
  • 8.
     Number ofentailments/sub-queries  Number of projection variables  Number of BGPs  Number of triple patterns  Max. number BGP triple patterns  Min. number BGP triple patterns  Number of join vertices  Mean join vertex degree  Number of LSQ features 8
  • 9.
    1. Selection ofsuper-queries 2. Normalized feature vectors 3. Generation of clusters 4. Selection of most representative queries 9
  • 10.
  • 11.
    11 2 2 1 5 5 5 3 2.3 2  Number ofentailments/sub-queries  Number of projection variables  Number of BGPs  Number of triple patterns  Max. number BGP triple patterns  Min. number BGP triple patterns  Number of join vertices  Mean join vertex degree  Number of LSQ features 10 8 6 12 5 10 10 5 30 0.2 0.25 0.16 0.41 1 0.5 0.33 0.46 0.06 Feature vector Max. feature vector Normalized feature vector F M F/M
  • 12.
     FEASIBLE  FEASIBLE-Exemplars KMeans++  DBSCAN+KMeans++  Agglomerative  Random selection 12
  • 13.
    13 Plot normalized featurevectors in a multidimensional space Query F1 F2 Q1 0.2 0.2 Q2 0.5 0.3 Q3 0.8 0.3 Q4 0.9 0.1 Q5 0.5 0.5 Q6 0.2 0.7 Q7 0.1 0.8 Q8 0.13 0.65 Q9 0.9 0.5 Q10 0.1 0.5 Suppose we need a benchmark of 3 queries Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
  • 14.
  • 15.
    15 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 Avg. Avg. Avg. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.10.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Calculate Average across each cluster
  • 16.
    16 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 Avg. Avg. Avg. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.10.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Calculate distance of each point in cluster to the average
  • 17.
    17 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 Avg. Avg. Avg. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.10.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Select minimum distance query as the final benchmark query from that cluster Purple, i.e., Q2 is the final selected query from yellow cluster
  • 18.
     Number ofprojection variables in the super- queries should be at most 2  Number of BGPs should be greater than 1 or the number of triple patterns should be greater than 3  Benchmark should be selected from the most recently executed 1000 queries 18
  • 19.
    19  Similarity error Diversity score L is the query log, B is the benchmark, and k is the set of all features  We compared  FEASIBLE  FEASIBLE-Exemplars  KMeans++  DBSCAN+KMeans++  Random selection  Number of containment tests (#T)  Benchmark generation time (G) in sec
  • 20.
    20  Query Mixesper Hour (QMpH)  Number of handled test cases  Number of timed out test cases  We compared  TreeSolver  AFMU  SPARQL-Algebra  JSAC We generated benchmarks using Semantic Web Dog Food (SWDF) and DBpedia queries logs
  • 21.
    21 0 0.01 0.02 0.03 0.04 0.05 15 25 5075 100 125 SIMILARITYERROR #SUPER QUERIES FEASIBLE KMeans++ DBScan+KMeans++ Random FEASIBLE-Exemplars 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 2 4 6 9 12 15 SIMILARITYERROR #SUPER QUERIES FEASIBLE KMeans++ DBScan+KMeans++ Random FEASIBLE-Exemplars (SWDF) (DBpedia) • Similarity error is inversely (in general) proportional to benchmark size • Random selection in general generates benchmarks of smaller similarity errors
  • 22.
    22 (SWDF) (DBpedia) • Diversityscore is inversely (in general) proportional to benchmark size • FEASIBLE-Exemplars generates the more diverse benchmarks 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 15 25 50 75 100 125 DIVERSITYSCORE #SUPER QUERIES FEASIBLE KMeans++ DBScan+KMeans++ Random FEASIBLE-Exemplars 0 0.1 0.2 0.3 0.4 0.5 2 4 6 9 12 15 DIVERSITYSCORE #SUPER QUERIES FEASIBLE KMeans++ DBScan+KMeans++ Random FEASIBLE-Exemplars
  • 23.
  • 24.
    24 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 NormalizedS.D. SQCFrameWork-FEASIBLE-Exemplars SQC-Benchmark • SQCFrameWork-FEASIBLE-Exemplars ismore diverse across majority of the query features *SQC-Benchmark: http://sparql-qc-bench.inrialpes.fr/
  • 25.
    25 • JSAC correctlyhandled all cases in with reasonable QMpH 0 0.5 1 1.5 2 QMpH TreeSolver AFMU JSAC SPARQL-Algebra Total Tests #Handled Tests #Correct Test #Timeout Tests TreeSolver 1192 5 5 2 AFMU 1192 5 5 12 SPARQL-Algebra 1192 0 0 0 JSAC 1192 1192 1192 0
  • 26.
    26  SQCFramework:  Basedon real data, real log queries  Flexible  Customizable  Use-case specific  Similarity error is inversely (in general) proportional to benchmark size  Random selection in general generates benchmarks of smaller similarity errors  Diversity score is inversely (in general) proportional to benchmark size  FEASIBLE-Exemplars generates the more diverse benchmarks  JSAC correctly handled all cases in with reasonable QMpH  SQCFramework available from (https://github.com/dice-group/sqcframework)
  • 27.
  • 28.