Predic'ng	
  SPARQL	
  Query	
  Execu'on	
  
Time	
  and	
  Sugges'ng	
  SPARQL	
  
Queries	
  Based	
  on	
  Query	
  History	
  
Rakebul	
  Hasan	
  
Context	
  
•  Assis'ng	
  human	
  users	
  and	
  soAware	
  agents	
  in:	
  
–  Querying	
  Seman'c	
  Web	
  data	
  
•  Understanding	
  query	
  behavior:	
  predic'ng	
  query	
  
performance	
  
–  Workload	
  management,	
  query	
  scheduling,	
  query	
  op'miza'on	
  

•  Construc'ng	
  and	
  refining	
  queries:	
  sugges'ng	
  alterna'ves	
  

–  Consuming	
  Seman'c	
  Web	
  data	
  
•  Understanding	
  reasoning	
  of	
  Seman'c	
  Web	
  soAware	
  agents:	
  
explaining	
  reasoning	
  
–  Transparency,	
  trust,	
  scrutability,	
  decision	
  effec'veness,	
  decision	
  
efficiency,	
  user	
  sa'sfac'on	
  
1	
  
Outline	
  
•  Predic'ng	
  SPARQL	
  query	
  execu'on	
  'me	
  

•  Sugges'ng	
  similar	
  SPARQL	
  queries	
  from	
  query	
  
history	
  

2	
  
PREDICTING	
  SPARQL	
  QUERY	
  
EXECUTION	
  TIME	
  
3	
  
•  Accurately	
  predic'ng	
  query	
  performance	
  
enables	
  effec've	
  	
  
–  workload	
  management	
  
–  query	
  scheduling	
  
–  query	
  op'miza'on	
  

4	
  
Understanding	
  performance	
  of	
  
computer	
  programs	
  

Insight.	
  [Knuth]	
  Use	
  scien'fic	
  method	
  to	
  
understand	
  performance	
  

5	
  
Scien'fic	
  method	
  applied	
  to	
  analysis	
  of	
  
algorithms	
  
•  A	
  framework	
  for	
  predic'ng	
  performance	
  and	
  comparing	
  
algorithms.	
  
•  Scien'fic	
  method	
  
– 
– 
– 
– 
– 

Observe	
  some	
  feature	
  of	
  the	
  natural	
  world.	
  
Hypothesize	
  a	
  model	
  that	
  is	
  consistent	
  with	
  the	
  observa'ons.	
  
Predict	
  events	
  using	
  the	
  hypothesis.	
  
Verify	
  the	
  predic'ons	
  by	
  making	
  further	
  observa'ons.	
  
Validate	
  by	
  repea'ng	
  un'l	
  the	
  hypothesis	
  and	
  observa'ons	
  
agree.	
  

•  Principles	
  

–  Experiments	
  must	
  be	
  reproducible.	
  	
  
–  Hypotheses	
  must	
  be	
  falsifiable.	
  	
  

•  Feature	
  of	
  the	
  natural	
  world.	
  Computer	
  itself.	
  
Slide	
  credit:	
  Robert	
  Sedgewick	
  

6	
  
Example:	
  3-­‐Sum	
  
•  3-­‐SUM.	
  Given	
  N dis'nct	
  integers,	
  how	
  many	
  
triples	
  sum	
  to	
  exactly	
  zero?	
  
•  3-­‐SUM	
  brute-­‐force	
  algorithm.	
  Check	
  all	
  the	
  
possible	
  triples.	
  
•  How	
  much	
  'me	
  does	
  it	
  take?	
  

Slide	
  credit:	
  Robert	
  Sedgewick	
  

7	
  
Data	
  analysis	
  
•  Standard	
  plot.	
  Plot	
  running	
  'me	
  T (N)	
  vs.	
  input	
  size	
  N.	
  

Slide	
  credit:	
  Robert	
  Sedgewick	
  

8	
  
Data	
  analysis	
  
•  Log-­‐log	
  plot.	
  Plot	
  running	
  'me	
  lg(T (N))	
  vs.	
  input	
  size lg N.	
  

•  Regression.	
  Fit	
  straight	
  line	
  through	
  data	
  points:	
  a N b.	
  
•  Hypothesis.	
  The	
  running	
  'me	
  is	
  about	
  1.006 × 10 –10 × N 2.999
Slide	
  credit:	
  Robert	
  Sedgewick	
  

9	
  
Predic'on	
  and	
  valida'on	
  
•  Hypothesis.	
  The	
  running	
  'me	
  is	
  about	
  1.006 × 10 –10 × N 2.999

•  Predic'ons.	
  
–  51.0	
  seconds	
  for	
  N =	
  8000.	
  
–  408.1	
  seconds	
  for	
  N =	
  16000.	
  

•  Observa'ons.	
  

Validates	
  the	
  hypothesis	
  

Slide	
  credit:	
  Robert	
  Sedgewick	
  

10	
  
Understanding	
  performance	
  of	
  
database	
  queries	
  
•  Ganapathi	
  et	
  al.	
  predic'ng	
  performance	
  
metrics	
  of	
  database	
  queries	
  prior	
  to	
  query	
  
execu'on	
  using	
  machine	
  learning.	
  
•  Gupta	
  et	
  al.	
  use	
  machine	
  learning	
  for	
  
predic'ng	
  query	
  execu'on	
  'me	
  ranges.	
  

Ganapathi	
  et	
  al.:	
  Predic'ng	
  mul'ple	
  metrics	
  for	
  queries:	
  Befer	
  decisions	
  enabled	
  by	
  machine	
  learning.	
  In	
  Proc.	
  of	
  the	
  2009	
  IEEE	
  ICDE	
  
Gupta	
  et	
  al.:	
  PQR:	
  Predic'ng	
  query	
  execu'on	
  'mes	
  for	
  autonomous	
  workload	
  management.	
  In	
  Proc.	
  of	
  the	
  2008	
  ICAC	
  

11	
  
Predic'ng	
  SPARQL	
  query	
  execu'on	
  
'me	
  
•  Key	
  challenge.	
  Feature	
  engineering	
  
–  Represen'ng	
  SPARQL	
  queries	
  as	
  feature	
  vectors	
  
•  Each	
  dimension	
  of	
  the	
  vector	
  is	
  a	
  feature	
  

12	
  
Configura'on	
  
•  Apache	
  Jena	
  TDB	
  
–  With	
  DBpedia	
  3.8	
  dataset	
  	
  

•  Training,	
  valida'on,	
  and	
  test	
  queries:	
  
randomly	
  selected	
  from	
  DBpedia	
  SPARQL	
  
Benchmark	
  (DBPSB)	
  query	
  dataset	
  
–  3600	
  training,	
  1200	
  valida'on,	
  1200	
  test	
  	
  

13	
  
Jena	
  ARQ	
  query	
  processing	
  
•  A	
  SPARQL	
  query	
  in	
  ARQ	
  goes	
  through	
  several	
  
stages	
  of	
  processing:	
  
–  String	
  to	
  Query	
  (parsing)	
  
–  Transla'on	
  from	
  Query	
  to	
  a	
  SPARQL	
  algebra	
  
expression	
  
–  Op'miza'on	
  of	
  the	
  algebra	
  expression	
  
–  Query	
  plan	
  determina'on	
  and	
  low-­‐level	
  
op'miza'on	
  
–  Evalua'on	
  of	
  the	
  query	
  plan	
  
14	
  
SPARQL	
  algebra	
  features	
  
•  SPARQL	
  Algebra1	
  

1	
  hfp://www.w3.org/TR/sparql11-­‐query/#sparqlQuery	
  

15	
  
SPARQL	
  algebra	
  features	
  
DEFGHI,4)/48,9>$$'8JJ703%#<&)0J4)/4JA<BJ=,
KFLFMN,OHKNHPMN,.%/0+,.%"&1,QRFEF,S,
,,,.7,4)/4805)7,90/"3$)8'+(#)%:#+(;+(<&)0=,<,
,,,.7,4)/48%/0+,.%/0+,,
,,,TDNHTPUL,S,.7,4)/48%"&1,.%"&1,V
V

!"#$"%&$
'()*+&$,-.%/0+,.%"&12

3+4$*)"%

56'

$("'3+,
.7,
4)/4805)7,
90/"3$)8'+(#)%:#+(;+(<&)0=

56'

$("'3+,
.7,
4)/48%/0+,
.%/0+

$("'3+,
.7,
4)/48%"&1
.%"&1

$("'3+,56',*)"%,3+4$*)"%,<,<,<,<,'()*+&$,!"#$"%&$,<,<,<,<,!+'$>
,,?,,,,,@,,,A,,,,,,B,,,,,<,<,<,<,,,,B,,,,,,,B,,,,,<,<,<,<,,,C

16	
  
Experiment	
  1	
  
•  Model:	
  Support	
  Vector	
  Machine	
  regression	
  
•  Evalua'on	
  measure:	
  R2

• 

Measures	
  how	
  well	
  future	
  samples	
  are	
  likely	
  to	
  be	
  predicted	
  by	
  the	
  
model.	
  

17	
  
Experiment	
  1	
  
•  Test	
  dataset	
  R2	
  =	
  0.004492	
  

Log	
  scale	
  plomng	
  of	
  predicted	
  vs	
  actual	
  execu'on	
  'mes	
  for	
  the	
  test	
  queries.	
  

18	
  
Experiment	
  1	
  
Some	
  of	
  the	
  long	
  running	
  queries	
  share	
  structurally	
  
similar	
  basic	
  graph	
  paferns.	
  
{	
  	
  
	
  dbpedia	
  :1549	
  _Mikko	
  ?p	
  ?	
  uri	
  .	
  
	
  ?	
  uri	
  rdf	
  :	
  type	
  ?x	
  
}	
  
Challenge.	
  How	
  do	
  we	
  represent	
  basic	
  graph	
  
paferns	
  as	
  vectors?	
  
19	
  
Basic	
  Graph	
  Pafern	
  Features	
  
•  Infinite	
  number	
  of	
  possibili'es	
  to	
  write	
  a	
  basic	
  graph	
  
pafern	
  (BGP)	
  
•  Only	
  the	
  set	
  of	
  literal	
  values	
  and	
  the	
  set	
  of	
  resources	
  
appearing	
  in	
  the	
  RDF	
  graph	
  
–  Exponen'al	
  number	
  of	
  possibili'es	
  
–  A	
  graph	
  with	
  n	
  triples	
  has	
  2n subsets	
  of	
  triples	
  	
  

•  Feature	
  vector	
  with	
  exponen'al	
  number	
  of	
  dimensions	
  
–  Not	
  feasible	
  	
  
20	
  
Basic	
  Graph	
  Pafern	
  Features	
  
•  Pafern	
  graph	
  =	
  RDF	
  graph	
  constructed	
  from	
  
all	
  the	
  BGPs	
  in	
  a	
  query	
  
–  Replace	
  variables	
  with	
  a	
  fixed	
  symbol	
  ‘?’	
  

•  Cluster	
  the	
  training	
  queries	
  based	
  on	
  pafern	
  
graph	
  similari'es	
  
•  Create	
  a	
  vector	
  with	
  similarity	
  scores	
  between	
  
the	
  pafern	
  graph	
  of	
  the	
  query	
  and	
  the	
  queries	
  
in	
  the	
  cluster	
  centers.	
  
21	
  
•  Graph	
  Edit	
  Distance	
  
–  Minimum	
  amount	
  of	
  distor'on	
  needed	
  to	
  
transform	
  one	
  graph	
  to	
  another	
  

–  Compute	
  similarity	
  by	
  inversing	
  distance	
  

22	
  
•  Graph	
  Edit	
  Distance	
  
–  Usually	
  computed	
  using	
  A*	
  search	
  	
  
•  Exponen'al	
  running	
  'me	
  

–  Bipar'te	
  matching	
  based	
  approximated	
  graph	
  edit	
  
distance	
  with	
  	
  
•  Previous	
  research	
  shows	
  very	
  accurate	
  results	
  with	
  
classifica'on	
  problems	
  

23	
  
•  Clustering	
  Training	
  Queries	
  
–  K-­‐mediods	
  clustering	
  algorithm	
  with	
  approximated	
  
edit	
  distance	
  as	
  distance	
  func'on	
  
•  Selects	
  data	
  points	
  as	
  cluster	
  centers	
  
•  Arbitrary	
  distance	
  func'on	
  

24	
  
Experiment	
  2	
  
•  Model:	
  Support	
  Vector	
  Machine	
  regression	
  
•  Test	
  dataset	
  R2	
  =	
  0.124204	
  
•  K	
  =	
  10	
  

Algebra	
  features	
  

Algebra	
  +	
  BGP	
  features	
  

25	
  
Mul'ple	
  Regressions	
  
•  We	
  train	
  different	
  SMV	
  regressions	
  for	
  
different	
  'me	
  ranges.	
  
•  The	
  variance	
  in	
  y-­‐axis	
  is	
  less	
  for	
  each	
  
regression,	
  easier	
  to	
  fit	
  a	
  curve.	
  

26	
  
•  Different	
  'me	
  ranges	
  
–  Clustering	
  the	
  execu'on	
  'me	
  ranges	
  
•  We	
  use	
  x-­‐means	
  clustering	
  algorithm	
  which	
  
automa'cally	
  es'mates	
  the	
  number	
  of	
  clusters	
  
–  5	
  clusters	
  found	
  in	
  the	
  training	
  dataset	
  

–  Each	
  cluster	
  contains	
  queries	
  with	
  similar	
  
execu'on	
  'mes	
  

27	
  
•  Predic'ng	
  execu'on	
  'me	
  range	
  
–  Predict	
  the	
  corresponding	
  clusters	
  for	
  unseen	
  
queries.	
  
–  How	
  
•  Train	
  a	
  SMV	
  classifier	
  with	
  the	
  found	
  clusters	
  as	
  labels	
  
•  Classify	
  unseen	
  queries:	
  accuracy	
  of	
  96%	
  for	
  the	
  test	
  
dataset	
  	
  
•  This	
  means	
  we	
  can	
  accurately	
  predict	
  'me	
  ranges	
  
28	
  
•  Predic'ng	
  execu'on	
  'me	
  
–  Different	
  SMV	
  regressions	
  for	
  different	
  'me	
  
ranges.	
  
–  Use	
  the	
  corresponding	
  regression	
  to	
  the	
  'me	
  
range	
  cluster	
  for	
  an	
  unseen	
  query	
  

29	
  
Experiment	
  3	
  
•  Test	
  dataset	
  R2	
  =	
  0.83862	
  

Algebra	
  +	
  BGP	
  features	
  

Mul'ple	
  regressions	
  
30	
  
Predic'ng	
  with	
  nearest	
  neighbors	
  
regression	
  
•  The	
  k-­‐nearest	
  neighbors	
  algorithm	
  (k-­‐NN)	
  is	
  
oAen	
  successful	
  in	
  the	
  cases	
  where	
  decision	
  
boundary	
  is	
  irregular.	
  
•  We	
  train	
  a	
  k-­‐NN	
  with	
  	
  
–  Euclidean	
  distance	
  as	
  the	
  distance	
  func'on	
  
–  Distance	
  weigh'ng:	
  weighted	
  by	
  the	
  inverse	
  of	
  the	
  
distance	
  

31	
  
•  k-­‐dimensional	
  tree	
  (k-­‐d	
  tree)	
  data	
  structure	
  to	
  
search	
  the	
  nearest	
  neighbors	
  	
  
–  a	
  space-­‐par''oning	
  data	
  structure	
  for	
  organizing	
  
points	
  in	
  a	
  k-­‐dimensional	
  space	
  

•  Complexity	
  of	
  a	
  search:	
  O(log N)	
  opera'ons	
  

32	
  
Experiment	
  4	
  
•  Test	
  dataset	
  R2	
  =	
  0.837	
  
•  k=2	
  for	
  k-­‐NN	
  (selected	
  by	
  cross	
  valida'on)	
  

Mul'ple	
  regressions	
  

k-­‐NN	
  

33	
  
•  Future	
  work	
  
–  Training	
  data	
  with	
  broad	
  coverage	
  
•  DBpedia	
  SPARQL	
  benchmark	
  query	
  templates	
  	
  
–  Berlin:	
  5	
  templates	
  
–  DBPSB:	
  20	
  templates	
  

–  Fine	
  tuning	
  with	
  more	
  cross	
  valida'on	
  

34	
  
SUGGESTING	
  SPARQL	
  QUERIES	
  

35	
  
Sugges'ng	
  SPARQL	
  queries	
  based	
  on	
  
query	
  history	
  
•  Use	
  the	
  same	
  features	
  	
  
•  Construct	
  a	
  k-­‐d	
  tree	
  for	
  nearest	
  neighbor	
  
search	
  
•  Top	
  M neighbors	
  for	
  a	
  query	
  are	
  the	
  top	
  M 	
  
sugges'ons	
  for	
  that	
  query	
  

36	
  
Example	
  
SELECT	
  DISTINCT	
  ?uri	
  
WHERE	
  
{	
  	
  
	
  dbpedia	
  :1549	
  _Mikko	
  ?p	
  ?	
  uri	
  .	
  
	
  ?	
  uri	
  rdf	
  :	
  type	
  ?x	
  
}	
  

Sugges'on	
  1	
  
SELECT	
  DISTINCT	
  ?uri	
  
WHERE	
  
{	
  	
  
	
  dbpedia	
  :	
  Radu_Sabo	
  ?p	
  ?	
  uri	
  .	
  
	
  ?	
  uri	
  rdf	
  :	
  type	
  ?x	
  
}	
  
Sugges'on	
  2	
  
SELECT	
  DISTINCT	
  ?uri	
  
WHERE	
  
{	
  	
  
	
  dbpedia	
  :	
  Hafar_Al	
  -­‐	
  Ba'n	
  ?p	
  ?	
  uri	
  .	
  
	
  ?	
  uri	
  rdf	
  :	
  type	
  ?x	
  
}	
  
Sugges'on	
  3	
  
SELECT	
  DISTINCT	
  ?uri	
  
WHERE	
  
{	
  	
  
	
  dbpedia	
  :	
  Maurice_D	
  ._G.	
  _Scof	
  ?p	
  ?	
  uri	
  .	
  
	
  ?	
  uri	
  rdf	
  :	
  type	
  ?x	
  
}	
  

37	
  
•  Future	
  work	
  
–  Query	
  construc'on	
  and	
  refinement	
  workflow	
  
•  How	
  to	
  use	
  the	
  query	
  sugges'ons?	
  

–  Evalua'ng	
  the	
  sugges'ons	
  
•  User	
  study	
  

38	
  
Thank	
  you	
  

39	
  

Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

  • 1.
    Predic'ng  SPARQL  Query  Execu'on   Time  and  Sugges'ng  SPARQL   Queries  Based  on  Query  History   Rakebul  Hasan  
  • 2.
    Context   •  Assis'ng  human  users  and  soAware  agents  in:   –  Querying  Seman'c  Web  data   •  Understanding  query  behavior:  predic'ng  query   performance   –  Workload  management,  query  scheduling,  query  op'miza'on   •  Construc'ng  and  refining  queries:  sugges'ng  alterna'ves   –  Consuming  Seman'c  Web  data   •  Understanding  reasoning  of  Seman'c  Web  soAware  agents:   explaining  reasoning   –  Transparency,  trust,  scrutability,  decision  effec'veness,  decision   efficiency,  user  sa'sfac'on   1  
  • 3.
    Outline   •  Predic'ng  SPARQL  query  execu'on  'me   •  Sugges'ng  similar  SPARQL  queries  from  query   history   2  
  • 4.
    PREDICTING  SPARQL  QUERY   EXECUTION  TIME   3  
  • 5.
    •  Accurately  predic'ng  query  performance   enables  effec've     –  workload  management   –  query  scheduling   –  query  op'miza'on   4  
  • 6.
    Understanding  performance  of   computer  programs   Insight.  [Knuth]  Use  scien'fic  method  to   understand  performance   5  
  • 7.
    Scien'fic  method  applied  to  analysis  of   algorithms   •  A  framework  for  predic'ng  performance  and  comparing   algorithms.   •  Scien'fic  method   –  –  –  –  –  Observe  some  feature  of  the  natural  world.   Hypothesize  a  model  that  is  consistent  with  the  observa'ons.   Predict  events  using  the  hypothesis.   Verify  the  predic'ons  by  making  further  observa'ons.   Validate  by  repea'ng  un'l  the  hypothesis  and  observa'ons   agree.   •  Principles   –  Experiments  must  be  reproducible.     –  Hypotheses  must  be  falsifiable.     •  Feature  of  the  natural  world.  Computer  itself.   Slide  credit:  Robert  Sedgewick   6  
  • 8.
    Example:  3-­‐Sum   • 3-­‐SUM.  Given  N dis'nct  integers,  how  many   triples  sum  to  exactly  zero?   •  3-­‐SUM  brute-­‐force  algorithm.  Check  all  the   possible  triples.   •  How  much  'me  does  it  take?   Slide  credit:  Robert  Sedgewick   7  
  • 9.
    Data  analysis   • Standard  plot.  Plot  running  'me  T (N)  vs.  input  size  N.   Slide  credit:  Robert  Sedgewick   8  
  • 10.
    Data  analysis   • Log-­‐log  plot.  Plot  running  'me  lg(T (N))  vs.  input  size lg N.   •  Regression.  Fit  straight  line  through  data  points:  a N b.   •  Hypothesis.  The  running  'me  is  about  1.006 × 10 –10 × N 2.999 Slide  credit:  Robert  Sedgewick   9  
  • 11.
    Predic'on  and  valida'on   •  Hypothesis.  The  running  'me  is  about  1.006 × 10 –10 × N 2.999 •  Predic'ons.   –  51.0  seconds  for  N =  8000.   –  408.1  seconds  for  N =  16000.   •  Observa'ons.   Validates  the  hypothesis   Slide  credit:  Robert  Sedgewick   10  
  • 12.
    Understanding  performance  of   database  queries   •  Ganapathi  et  al.  predic'ng  performance   metrics  of  database  queries  prior  to  query   execu'on  using  machine  learning.   •  Gupta  et  al.  use  machine  learning  for   predic'ng  query  execu'on  'me  ranges.   Ganapathi  et  al.:  Predic'ng  mul'ple  metrics  for  queries:  Befer  decisions  enabled  by  machine  learning.  In  Proc.  of  the  2009  IEEE  ICDE   Gupta  et  al.:  PQR:  Predic'ng  query  execu'on  'mes  for  autonomous  workload  management.  In  Proc.  of  the  2008  ICAC   11  
  • 13.
    Predic'ng  SPARQL  query  execu'on   'me   •  Key  challenge.  Feature  engineering   –  Represen'ng  SPARQL  queries  as  feature  vectors   •  Each  dimension  of  the  vector  is  a  feature   12  
  • 14.
    Configura'on   •  Apache  Jena  TDB   –  With  DBpedia  3.8  dataset     •  Training,  valida'on,  and  test  queries:   randomly  selected  from  DBpedia  SPARQL   Benchmark  (DBPSB)  query  dataset   –  3600  training,  1200  valida'on,  1200  test     13  
  • 15.
    Jena  ARQ  query  processing   •  A  SPARQL  query  in  ARQ  goes  through  several   stages  of  processing:   –  String  to  Query  (parsing)   –  Transla'on  from  Query  to  a  SPARQL  algebra   expression   –  Op'miza'on  of  the  algebra  expression   –  Query  plan  determina'on  and  low-­‐level   op'miza'on   –  Evalua'on  of  the  query  plan   14  
  • 16.
    SPARQL  algebra  features   •  SPARQL  Algebra1   1  hfp://www.w3.org/TR/sparql11-­‐query/#sparqlQuery   15  
  • 17.
    SPARQL  algebra  features   DEFGHI,4)/48,9>$$'8JJ703%#<&)0J4)/4JA<BJ=, KFLFMN,OHKNHPMN,.%/0+,.%"&1,QRFEF,S, ,,,.7,4)/4805)7,90/"3$)8'+(#)%:#+(;+(<&)0=,<, ,,,.7,4)/48%/0+,.%/0+,, ,,,TDNHTPUL,S,.7,4)/48%"&1,.%"&1,V V !"#$"%&$ '()*+&$,-.%/0+,.%"&12 3+4$*)"% 56' $("'3+, .7, 4)/4805)7, 90/"3$)8'+(#)%:#+(;+(<&)0= 56' $("'3+, .7, 4)/48%/0+, .%/0+ $("'3+, .7, 4)/48%"&1 .%"&1 $("'3+,56',*)"%,3+4$*)"%,<,<,<,<,'()*+&$,!"#$"%&$,<,<,<,<,!+'$> ,,?,,,,,@,,,A,,,,,,B,,,,,<,<,<,<,,,,B,,,,,,,B,,,,,<,<,<,<,,,C 16  
  • 18.
    Experiment  1   • Model:  Support  Vector  Machine  regression   •  Evalua'on  measure:  R2 •  Measures  how  well  future  samples  are  likely  to  be  predicted  by  the   model.   17  
  • 19.
    Experiment  1   • Test  dataset  R2  =  0.004492   Log  scale  plomng  of  predicted  vs  actual  execu'on  'mes  for  the  test  queries.   18  
  • 20.
    Experiment  1   Some  of  the  long  running  queries  share  structurally   similar  basic  graph  paferns.   {      dbpedia  :1549  _Mikko  ?p  ?  uri  .    ?  uri  rdf  :  type  ?x   }   Challenge.  How  do  we  represent  basic  graph   paferns  as  vectors?   19  
  • 21.
    Basic  Graph  Pafern  Features   •  Infinite  number  of  possibili'es  to  write  a  basic  graph   pafern  (BGP)   •  Only  the  set  of  literal  values  and  the  set  of  resources   appearing  in  the  RDF  graph   –  Exponen'al  number  of  possibili'es   –  A  graph  with  n  triples  has  2n subsets  of  triples     •  Feature  vector  with  exponen'al  number  of  dimensions   –  Not  feasible     20  
  • 22.
    Basic  Graph  Pafern  Features   •  Pafern  graph  =  RDF  graph  constructed  from   all  the  BGPs  in  a  query   –  Replace  variables  with  a  fixed  symbol  ‘?’   •  Cluster  the  training  queries  based  on  pafern   graph  similari'es   •  Create  a  vector  with  similarity  scores  between   the  pafern  graph  of  the  query  and  the  queries   in  the  cluster  centers.   21  
  • 23.
    •  Graph  Edit  Distance   –  Minimum  amount  of  distor'on  needed  to   transform  one  graph  to  another   –  Compute  similarity  by  inversing  distance   22  
  • 24.
    •  Graph  Edit  Distance   –  Usually  computed  using  A*  search     •  Exponen'al  running  'me   –  Bipar'te  matching  based  approximated  graph  edit   distance  with     •  Previous  research  shows  very  accurate  results  with   classifica'on  problems   23  
  • 25.
    •  Clustering  Training  Queries   –  K-­‐mediods  clustering  algorithm  with  approximated   edit  distance  as  distance  func'on   •  Selects  data  points  as  cluster  centers   •  Arbitrary  distance  func'on   24  
  • 26.
    Experiment  2   • Model:  Support  Vector  Machine  regression   •  Test  dataset  R2  =  0.124204   •  K  =  10   Algebra  features   Algebra  +  BGP  features   25  
  • 27.
    Mul'ple  Regressions   • We  train  different  SMV  regressions  for   different  'me  ranges.   •  The  variance  in  y-­‐axis  is  less  for  each   regression,  easier  to  fit  a  curve.   26  
  • 28.
    •  Different  'me  ranges   –  Clustering  the  execu'on  'me  ranges   •  We  use  x-­‐means  clustering  algorithm  which   automa'cally  es'mates  the  number  of  clusters   –  5  clusters  found  in  the  training  dataset   –  Each  cluster  contains  queries  with  similar   execu'on  'mes   27  
  • 29.
    •  Predic'ng  execu'on  'me  range   –  Predict  the  corresponding  clusters  for  unseen   queries.   –  How   •  Train  a  SMV  classifier  with  the  found  clusters  as  labels   •  Classify  unseen  queries:  accuracy  of  96%  for  the  test   dataset     •  This  means  we  can  accurately  predict  'me  ranges   28  
  • 30.
    •  Predic'ng  execu'on  'me   –  Different  SMV  regressions  for  different  'me   ranges.   –  Use  the  corresponding  regression  to  the  'me   range  cluster  for  an  unseen  query   29  
  • 31.
    Experiment  3   • Test  dataset  R2  =  0.83862   Algebra  +  BGP  features   Mul'ple  regressions   30  
  • 32.
    Predic'ng  with  nearest  neighbors   regression   •  The  k-­‐nearest  neighbors  algorithm  (k-­‐NN)  is   oAen  successful  in  the  cases  where  decision   boundary  is  irregular.   •  We  train  a  k-­‐NN  with     –  Euclidean  distance  as  the  distance  func'on   –  Distance  weigh'ng:  weighted  by  the  inverse  of  the   distance   31  
  • 33.
    •  k-­‐dimensional  tree  (k-­‐d  tree)  data  structure  to   search  the  nearest  neighbors     –  a  space-­‐par''oning  data  structure  for  organizing   points  in  a  k-­‐dimensional  space   •  Complexity  of  a  search:  O(log N)  opera'ons   32  
  • 34.
    Experiment  4   • Test  dataset  R2  =  0.837   •  k=2  for  k-­‐NN  (selected  by  cross  valida'on)   Mul'ple  regressions   k-­‐NN   33  
  • 35.
    •  Future  work   –  Training  data  with  broad  coverage   •  DBpedia  SPARQL  benchmark  query  templates     –  Berlin:  5  templates   –  DBPSB:  20  templates   –  Fine  tuning  with  more  cross  valida'on   34  
  • 36.
  • 37.
    Sugges'ng  SPARQL  queries  based  on   query  history   •  Use  the  same  features     •  Construct  a  k-­‐d  tree  for  nearest  neighbor   search   •  Top  M neighbors  for  a  query  are  the  top  M   sugges'ons  for  that  query   36  
  • 38.
    Example   SELECT  DISTINCT  ?uri   WHERE   {      dbpedia  :1549  _Mikko  ?p  ?  uri  .    ?  uri  rdf  :  type  ?x   }   Sugges'on  1   SELECT  DISTINCT  ?uri   WHERE   {      dbpedia  :  Radu_Sabo  ?p  ?  uri  .    ?  uri  rdf  :  type  ?x   }   Sugges'on  2   SELECT  DISTINCT  ?uri   WHERE   {      dbpedia  :  Hafar_Al  -­‐  Ba'n  ?p  ?  uri  .    ?  uri  rdf  :  type  ?x   }   Sugges'on  3   SELECT  DISTINCT  ?uri   WHERE   {      dbpedia  :  Maurice_D  ._G.  _Scof  ?p  ?  uri  .    ?  uri  rdf  :  type  ?x   }   37  
  • 39.
    •  Future  work   –  Query  construc'on  and  refinement  workflow   •  How  to  use  the  query  sugges'ons?   –  Evalua'ng  the  sugges'ons   •  User  study   38  
  • 40.