Predic'ng	
  SPARQL	
  Query	
  Execu'on	
  
Time	
  and	
  Sugges'ng	
  SPARQL	
  
Queries	
  Based	
  on	
  Query	
  His...
Context	
  
•  Assis'ng	
  human	
  users	
  and	
  soAware	
  agents	
  in:	
  
–  Querying	
  Seman'c	
  Web	
  data	
  ...
Outline	
  
•  Predic'ng	
  SPARQL	
  query	
  execu'on	
  'me	
  

•  Sugges'ng	
  similar	
  SPARQL	
  queries	
  from	
...
PREDICTING	
  SPARQL	
  QUERY	
  
EXECUTION	
  TIME	
  
3	
  
•  Accurately	
  predic'ng	
  query	
  performance	
  
enables	
  effec've	
  	
  
–  workload	
  management	
  
–  query	
...
Understanding	
  performance	
  of	
  
computer	
  programs	
  

Insight.	
  [Knuth]	
  Use	
  scien'fic	
  method	
  to	
 ...
Scien'fic	
  method	
  applied	
  to	
  analysis	
  of	
  
algorithms	
  
•  A	
  framework	
  for	
  predic'ng	
  performa...
Example:	
  3-­‐Sum	
  
•  3-­‐SUM.	
  Given	
  N dis'nct	
  integers,	
  how	
  many	
  
triples	
  sum	
  to	
  exactly	...
Data	
  analysis	
  
•  Standard	
  plot.	
  Plot	
  running	
  'me	
  T (N)	
  vs.	
  input	
  size	
  N.	
  

Slide	
  c...
Data	
  analysis	
  
•  Log-­‐log	
  plot.	
  Plot	
  running	
  'me	
  lg(T (N))	
  vs.	
  input	
  size lg N.	
  

•  Re...
Predic'on	
  and	
  valida'on	
  
•  Hypothesis.	
  The	
  running	
  'me	
  is	
  about	
  1.006 × 10 –10 × N 2.999

•  P...
Understanding	
  performance	
  of	
  
database	
  queries	
  
•  Ganapathi	
  et	
  al.	
  predic'ng	
  performance	
  
m...
Predic'ng	
  SPARQL	
  query	
  execu'on	
  
'me	
  
•  Key	
  challenge.	
  Feature	
  engineering	
  
–  Represen'ng	
  ...
Configura'on	
  
•  Apache	
  Jena	
  TDB	
  
–  With	
  DBpedia	
  3.8	
  dataset	
  	
  

•  Training,	
  valida'on,	
  a...
Jena	
  ARQ	
  query	
  processing	
  
•  A	
  SPARQL	
  query	
  in	
  ARQ	
  goes	
  through	
  several	
  
stages	
  of...
SPARQL	
  algebra	
  features	
  
•  SPARQL	
  Algebra1	
  

1	
  hfp://www.w3.org/TR/sparql11-­‐query/#sparqlQuery	
  

1...
SPARQL	
  algebra	
  features	
  
DEFGHI,4)/48,9>$$'8JJ703%#<&)0J4)/4JA<BJ=,
KFLFMN,OHKNHPMN,.%/0+,.%"&1,QRFEF,S,
,,,.7,4)...
Experiment	
  1	
  
•  Model:	
  Support	
  Vector	
  Machine	
  regression	
  
•  Evalua'on	
  measure:	
  R2

• 

Measur...
Experiment	
  1	
  
•  Test	
  dataset	
  R2	
  =	
  0.004492	
  

Log	
  scale	
  plomng	
  of	
  predicted	
  vs	
  actu...
Experiment	
  1	
  
Some	
  of	
  the	
  long	
  running	
  queries	
  share	
  structurally	
  
similar	
  basic	
  graph...
Basic	
  Graph	
  Pafern	
  Features	
  
•  Infinite	
  number	
  of	
  possibili'es	
  to	
  write	
  a	
  basic	
  graph	...
Basic	
  Graph	
  Pafern	
  Features	
  
•  Pafern	
  graph	
  =	
  RDF	
  graph	
  constructed	
  from	
  
all	
  the	
  ...
•  Graph	
  Edit	
  Distance	
  
–  Minimum	
  amount	
  of	
  distor'on	
  needed	
  to	
  
transform	
  one	
  graph	
  ...
•  Graph	
  Edit	
  Distance	
  
–  Usually	
  computed	
  using	
  A*	
  search	
  	
  
•  Exponen'al	
  running	
  'me	
...
•  Clustering	
  Training	
  Queries	
  
–  K-­‐mediods	
  clustering	
  algorithm	
  with	
  approximated	
  
edit	
  dis...
Experiment	
  2	
  
•  Model:	
  Support	
  Vector	
  Machine	
  regression	
  
•  Test	
  dataset	
  R2	
  =	
  0.124204	...
Mul'ple	
  Regressions	
  
•  We	
  train	
  different	
  SMV	
  regressions	
  for	
  
different	
  'me	
  ranges.	
  
•  T...
•  Different	
  'me	
  ranges	
  
–  Clustering	
  the	
  execu'on	
  'me	
  ranges	
  
•  We	
  use	
  x-­‐means	
  cluste...
•  Predic'ng	
  execu'on	
  'me	
  range	
  
–  Predict	
  the	
  corresponding	
  clusters	
  for	
  unseen	
  
queries.	...
•  Predic'ng	
  execu'on	
  'me	
  
–  Different	
  SMV	
  regressions	
  for	
  different	
  'me	
  
ranges.	
  
–  Use	
  ...
Experiment	
  3	
  
•  Test	
  dataset	
  R2	
  =	
  0.83862	
  

Algebra	
  +	
  BGP	
  features	
  

Mul'ple	
  regressi...
Predic'ng	
  with	
  nearest	
  neighbors	
  
regression	
  
•  The	
  k-­‐nearest	
  neighbors	
  algorithm	
  (k-­‐NN)	
...
•  k-­‐dimensional	
  tree	
  (k-­‐d	
  tree)	
  data	
  structure	
  to	
  
search	
  the	
  nearest	
  neighbors	
  	
  ...
Experiment	
  4	
  
•  Test	
  dataset	
  R2	
  =	
  0.837	
  
•  k=2	
  for	
  k-­‐NN	
  (selected	
  by	
  cross	
  vali...
•  Future	
  work	
  
–  Training	
  data	
  with	
  broad	
  coverage	
  
•  DBpedia	
  SPARQL	
  benchmark	
  query	
  t...
SUGGESTING	
  SPARQL	
  QUERIES	
  

35	
  
Sugges'ng	
  SPARQL	
  queries	
  based	
  on	
  
query	
  history	
  
•  Use	
  the	
  same	
  features	
  	
  
•  Constr...
Example	
  
SELECT	
  DISTINCT	
  ?uri	
  
WHERE	
  
{	
  	
  
	
  dbpedia	
  :1549	
  _Mikko	
  ?p	
  ?	
  uri	
  .	
  
	...
•  Future	
  work	
  
–  Query	
  construc'on	
  and	
  refinement	
  workflow	
  
•  How	
  to	
  use	
  the	
  query	
  su...
Thank	
  you	
  

39	
  
Upcoming SlideShare
Loading in...5
×

Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

424

Published on

Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
424
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

  1. 1. Predic'ng  SPARQL  Query  Execu'on   Time  and  Sugges'ng  SPARQL   Queries  Based  on  Query  History   Rakebul  Hasan  
  2. 2. Context   •  Assis'ng  human  users  and  soAware  agents  in:   –  Querying  Seman'c  Web  data   •  Understanding  query  behavior:  predic'ng  query   performance   –  Workload  management,  query  scheduling,  query  op'miza'on   •  Construc'ng  and  refining  queries:  sugges'ng  alterna'ves   –  Consuming  Seman'c  Web  data   •  Understanding  reasoning  of  Seman'c  Web  soAware  agents:   explaining  reasoning   –  Transparency,  trust,  scrutability,  decision  effec'veness,  decision   efficiency,  user  sa'sfac'on   1  
  3. 3. Outline   •  Predic'ng  SPARQL  query  execu'on  'me   •  Sugges'ng  similar  SPARQL  queries  from  query   history   2  
  4. 4. PREDICTING  SPARQL  QUERY   EXECUTION  TIME   3  
  5. 5. •  Accurately  predic'ng  query  performance   enables  effec've     –  workload  management   –  query  scheduling   –  query  op'miza'on   4  
  6. 6. Understanding  performance  of   computer  programs   Insight.  [Knuth]  Use  scien'fic  method  to   understand  performance   5  
  7. 7. Scien'fic  method  applied  to  analysis  of   algorithms   •  A  framework  for  predic'ng  performance  and  comparing   algorithms.   •  Scien'fic  method   –  –  –  –  –  Observe  some  feature  of  the  natural  world.   Hypothesize  a  model  that  is  consistent  with  the  observa'ons.   Predict  events  using  the  hypothesis.   Verify  the  predic'ons  by  making  further  observa'ons.   Validate  by  repea'ng  un'l  the  hypothesis  and  observa'ons   agree.   •  Principles   –  Experiments  must  be  reproducible.     –  Hypotheses  must  be  falsifiable.     •  Feature  of  the  natural  world.  Computer  itself.   Slide  credit:  Robert  Sedgewick   6  
  8. 8. Example:  3-­‐Sum   •  3-­‐SUM.  Given  N dis'nct  integers,  how  many   triples  sum  to  exactly  zero?   •  3-­‐SUM  brute-­‐force  algorithm.  Check  all  the   possible  triples.   •  How  much  'me  does  it  take?   Slide  credit:  Robert  Sedgewick   7  
  9. 9. Data  analysis   •  Standard  plot.  Plot  running  'me  T (N)  vs.  input  size  N.   Slide  credit:  Robert  Sedgewick   8  
  10. 10. Data  analysis   •  Log-­‐log  plot.  Plot  running  'me  lg(T (N))  vs.  input  size lg N.   •  Regression.  Fit  straight  line  through  data  points:  a N b.   •  Hypothesis.  The  running  'me  is  about  1.006 × 10 –10 × N 2.999 Slide  credit:  Robert  Sedgewick   9  
  11. 11. Predic'on  and  valida'on   •  Hypothesis.  The  running  'me  is  about  1.006 × 10 –10 × N 2.999 •  Predic'ons.   –  51.0  seconds  for  N =  8000.   –  408.1  seconds  for  N =  16000.   •  Observa'ons.   Validates  the  hypothesis   Slide  credit:  Robert  Sedgewick   10  
  12. 12. Understanding  performance  of   database  queries   •  Ganapathi  et  al.  predic'ng  performance   metrics  of  database  queries  prior  to  query   execu'on  using  machine  learning.   •  Gupta  et  al.  use  machine  learning  for   predic'ng  query  execu'on  'me  ranges.   Ganapathi  et  al.:  Predic'ng  mul'ple  metrics  for  queries:  Befer  decisions  enabled  by  machine  learning.  In  Proc.  of  the  2009  IEEE  ICDE   Gupta  et  al.:  PQR:  Predic'ng  query  execu'on  'mes  for  autonomous  workload  management.  In  Proc.  of  the  2008  ICAC   11  
  13. 13. Predic'ng  SPARQL  query  execu'on   'me   •  Key  challenge.  Feature  engineering   –  Represen'ng  SPARQL  queries  as  feature  vectors   •  Each  dimension  of  the  vector  is  a  feature   12  
  14. 14. Configura'on   •  Apache  Jena  TDB   –  With  DBpedia  3.8  dataset     •  Training,  valida'on,  and  test  queries:   randomly  selected  from  DBpedia  SPARQL   Benchmark  (DBPSB)  query  dataset   –  3600  training,  1200  valida'on,  1200  test     13  
  15. 15. Jena  ARQ  query  processing   •  A  SPARQL  query  in  ARQ  goes  through  several   stages  of  processing:   –  String  to  Query  (parsing)   –  Transla'on  from  Query  to  a  SPARQL  algebra   expression   –  Op'miza'on  of  the  algebra  expression   –  Query  plan  determina'on  and  low-­‐level   op'miza'on   –  Evalua'on  of  the  query  plan   14  
  16. 16. SPARQL  algebra  features   •  SPARQL  Algebra1   1  hfp://www.w3.org/TR/sparql11-­‐query/#sparqlQuery   15  
  17. 17. SPARQL  algebra  features   DEFGHI,4)/48,9>$$'8JJ703%#<&)0J4)/4JA<BJ=, KFLFMN,OHKNHPMN,.%/0+,.%"&1,QRFEF,S, ,,,.7,4)/4805)7,90/"3$)8'+(#)%:#+(;+(<&)0=,<, ,,,.7,4)/48%/0+,.%/0+,, ,,,TDNHTPUL,S,.7,4)/48%"&1,.%"&1,V V !"#$"%&$ '()*+&$,-.%/0+,.%"&12 3+4$*)"% 56' $("'3+, .7, 4)/4805)7, 90/"3$)8'+(#)%:#+(;+(<&)0= 56' $("'3+, .7, 4)/48%/0+, .%/0+ $("'3+, .7, 4)/48%"&1 .%"&1 $("'3+,56',*)"%,3+4$*)"%,<,<,<,<,'()*+&$,!"#$"%&$,<,<,<,<,!+'$> ,,?,,,,,@,,,A,,,,,,B,,,,,<,<,<,<,,,,B,,,,,,,B,,,,,<,<,<,<,,,C 16  
  18. 18. Experiment  1   •  Model:  Support  Vector  Machine  regression   •  Evalua'on  measure:  R2 •  Measures  how  well  future  samples  are  likely  to  be  predicted  by  the   model.   17  
  19. 19. Experiment  1   •  Test  dataset  R2  =  0.004492   Log  scale  plomng  of  predicted  vs  actual  execu'on  'mes  for  the  test  queries.   18  
  20. 20. Experiment  1   Some  of  the  long  running  queries  share  structurally   similar  basic  graph  paferns.   {      dbpedia  :1549  _Mikko  ?p  ?  uri  .    ?  uri  rdf  :  type  ?x   }   Challenge.  How  do  we  represent  basic  graph   paferns  as  vectors?   19  
  21. 21. Basic  Graph  Pafern  Features   •  Infinite  number  of  possibili'es  to  write  a  basic  graph   pafern  (BGP)   •  Only  the  set  of  literal  values  and  the  set  of  resources   appearing  in  the  RDF  graph   –  Exponen'al  number  of  possibili'es   –  A  graph  with  n  triples  has  2n subsets  of  triples     •  Feature  vector  with  exponen'al  number  of  dimensions   –  Not  feasible     20  
  22. 22. Basic  Graph  Pafern  Features   •  Pafern  graph  =  RDF  graph  constructed  from   all  the  BGPs  in  a  query   –  Replace  variables  with  a  fixed  symbol  ‘?’   •  Cluster  the  training  queries  based  on  pafern   graph  similari'es   •  Create  a  vector  with  similarity  scores  between   the  pafern  graph  of  the  query  and  the  queries   in  the  cluster  centers.   21  
  23. 23. •  Graph  Edit  Distance   –  Minimum  amount  of  distor'on  needed  to   transform  one  graph  to  another   –  Compute  similarity  by  inversing  distance   22  
  24. 24. •  Graph  Edit  Distance   –  Usually  computed  using  A*  search     •  Exponen'al  running  'me   –  Bipar'te  matching  based  approximated  graph  edit   distance  with     •  Previous  research  shows  very  accurate  results  with   classifica'on  problems   23  
  25. 25. •  Clustering  Training  Queries   –  K-­‐mediods  clustering  algorithm  with  approximated   edit  distance  as  distance  func'on   •  Selects  data  points  as  cluster  centers   •  Arbitrary  distance  func'on   24  
  26. 26. Experiment  2   •  Model:  Support  Vector  Machine  regression   •  Test  dataset  R2  =  0.124204   •  K  =  10   Algebra  features   Algebra  +  BGP  features   25  
  27. 27. Mul'ple  Regressions   •  We  train  different  SMV  regressions  for   different  'me  ranges.   •  The  variance  in  y-­‐axis  is  less  for  each   regression,  easier  to  fit  a  curve.   26  
  28. 28. •  Different  'me  ranges   –  Clustering  the  execu'on  'me  ranges   •  We  use  x-­‐means  clustering  algorithm  which   automa'cally  es'mates  the  number  of  clusters   –  5  clusters  found  in  the  training  dataset   –  Each  cluster  contains  queries  with  similar   execu'on  'mes   27  
  29. 29. •  Predic'ng  execu'on  'me  range   –  Predict  the  corresponding  clusters  for  unseen   queries.   –  How   •  Train  a  SMV  classifier  with  the  found  clusters  as  labels   •  Classify  unseen  queries:  accuracy  of  96%  for  the  test   dataset     •  This  means  we  can  accurately  predict  'me  ranges   28  
  30. 30. •  Predic'ng  execu'on  'me   –  Different  SMV  regressions  for  different  'me   ranges.   –  Use  the  corresponding  regression  to  the  'me   range  cluster  for  an  unseen  query   29  
  31. 31. Experiment  3   •  Test  dataset  R2  =  0.83862   Algebra  +  BGP  features   Mul'ple  regressions   30  
  32. 32. Predic'ng  with  nearest  neighbors   regression   •  The  k-­‐nearest  neighbors  algorithm  (k-­‐NN)  is   oAen  successful  in  the  cases  where  decision   boundary  is  irregular.   •  We  train  a  k-­‐NN  with     –  Euclidean  distance  as  the  distance  func'on   –  Distance  weigh'ng:  weighted  by  the  inverse  of  the   distance   31  
  33. 33. •  k-­‐dimensional  tree  (k-­‐d  tree)  data  structure  to   search  the  nearest  neighbors     –  a  space-­‐par''oning  data  structure  for  organizing   points  in  a  k-­‐dimensional  space   •  Complexity  of  a  search:  O(log N)  opera'ons   32  
  34. 34. Experiment  4   •  Test  dataset  R2  =  0.837   •  k=2  for  k-­‐NN  (selected  by  cross  valida'on)   Mul'ple  regressions   k-­‐NN   33  
  35. 35. •  Future  work   –  Training  data  with  broad  coverage   •  DBpedia  SPARQL  benchmark  query  templates     –  Berlin:  5  templates   –  DBPSB:  20  templates   –  Fine  tuning  with  more  cross  valida'on   34  
  36. 36. SUGGESTING  SPARQL  QUERIES   35  
  37. 37. Sugges'ng  SPARQL  queries  based  on   query  history   •  Use  the  same  features     •  Construct  a  k-­‐d  tree  for  nearest  neighbor   search   •  Top  M neighbors  for  a  query  are  the  top  M   sugges'ons  for  that  query   36  
  38. 38. Example   SELECT  DISTINCT  ?uri   WHERE   {      dbpedia  :1549  _Mikko  ?p  ?  uri  .    ?  uri  rdf  :  type  ?x   }   Sugges'on  1   SELECT  DISTINCT  ?uri   WHERE   {      dbpedia  :  Radu_Sabo  ?p  ?  uri  .    ?  uri  rdf  :  type  ?x   }   Sugges'on  2   SELECT  DISTINCT  ?uri   WHERE   {      dbpedia  :  Hafar_Al  -­‐  Ba'n  ?p  ?  uri  .    ?  uri  rdf  :  type  ?x   }   Sugges'on  3   SELECT  DISTINCT  ?uri   WHERE   {      dbpedia  :  Maurice_D  ._G.  _Scof  ?p  ?  uri  .    ?  uri  rdf  :  type  ?x   }   37  
  39. 39. •  Future  work   –  Query  construc'on  and  refinement  workflow   •  How  to  use  the  query  sugges'ons?   –  Evalua'ng  the  sugges'ons   •  User  study   38  
  40. 40. Thank  you   39  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×