Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Predic'ng
SPARQL
Query
Execu'on

Time
and
Sugges'ng
SPARQL

Queries
Based
on
Query
History

Rakebul
Hasan

Context

•  Assis'ng
human
users
and
soAware
agents
in:

–  Querying
Seman'c
Web
data

•  Understanding
query
behavior:
predic'ng
query

performance

–  Workload
management,
query
scheduling,
query
op'miza'on

•  Construc'ng
and
refining
queries:
sugges'ng
alterna'ves

–  Consuming
Seman'c
Web
data

•  Understanding
reasoning
of
Seman'c
Web
soAware
agents:

explaining
reasoning

–  Transparency,
trust,
scrutability,
decision
effec'veness,
decision

efficiency,
user
sa'sfac'on

1

Outline

•  Predic'ng
SPARQL
query
execu'on
'me

•  Sugges'ng
similar
SPARQL
queries
from
query

history

2

PREDICTING
SPARQL
QUERY

EXECUTION
TIME

3

•  Accurately
predic'ng
query
performance

enables
eﬀec've

–  workload
management

–  query
scheduling

–  query
op'miza'on

4

Understanding
performance
of

computer
programs

Insight.
[Knuth]
Use
scien'ﬁc
method
to

understand
performance

5

Scien'fic
method
applied
to
analysis
of

algorithms

•  A
framework
for
predic'ng
performance
and
comparing

algorithms.

•  Scien'fic
method

– 
– 
– 
– 
– 

Observe
some
feature
of
the
natural
world.

Hypothesize
a
model
that
is
consistent
with
the
observa'ons.

Predict
events
using
the
hypothesis.

Verify
the
predic'ons
by
making
further
observa'ons.

Validate
by
repea'ng
un'l
the
hypothesis
and
observa'ons

agree.

•  Principles

–  Experiments
must
be
reproducible.

–  Hypotheses
must
be
falsifiable.

•  Feature
of
the
natural
world.
Computer
itself.

Slide
credit:
Robert
Sedgewick

6

Example:
3-‐Sum

•  3-‐SUM.
Given
N dis'nct
integers,
how
many

triples
sum
to
exactly
zero?

•  3-‐SUM
brute-‐force
algorithm.
Check
all
the

possible
triples.

•  How
much
'me
does
it
take?

Slide
credit:
Robert
Sedgewick

7

Data
analysis

•  Standard
plot.
Plot
running
'me
T (N)
vs.
input
size
N.

Slide
credit:
Robert
Sedgewick

8

Data
analysis

•  Log-‐log
plot.
Plot
running
'me
lg(T (N))
vs.
input
size lg N.

•  Regression.
Fit
straight
line
through
data
points:
a N b.

•  Hypothesis.
The
running
'me
is
about
1.006 × 10 –10 × N 2.999
Slide
credit:
Robert
Sedgewick

9

Predic'on
and
valida'on

•  Hypothesis.
The
running
'me
is
about
1.006 × 10 –10 × N 2.999

•  Predic'ons.

–  51.0
seconds
for
N =
8000.

–  408.1
seconds
for
N =
16000.

•  Observa'ons.

Validates
the
hypothesis

Slide
credit:
Robert
Sedgewick

10

Understanding
performance
of

database
queries

•  Ganapathi
et
al.
predic'ng
performance

metrics
of
database
queries
prior
to
query

execu'on
using
machine
learning.

•  Gupta
et
al.
use
machine
learning
for

predic'ng
query
execu'on
'me
ranges.

Ganapathi
et
al.:
Predic'ng
mul'ple
metrics
for
queries:
Befer
decisions
enabled
by
machine
learning.
In
Proc.
of
the
2009
IEEE
ICDE

Gupta
et
al.:
PQR:
Predic'ng
query
execu'on
'mes
for
autonomous
workload
management.
In
Proc.
of
the
2008
ICAC

11

Predic'ng
SPARQL
query
execu'on

'me

•  Key
challenge.
Feature
engineering

–  Represen'ng
SPARQL
queries
as
feature
vectors

•  Each
dimension
of
the
vector
is
a
feature

12

Conﬁgura'on

•  Apache
Jena
TDB

–  With
DBpedia
3.8
dataset

•  Training,
valida'on,
and
test
queries:

randomly
selected
from
DBpedia
SPARQL

Benchmark
(DBPSB)
query
dataset

–  3600
training,
1200
valida'on,
1200
test

13

Jena
ARQ
query
processing

•  A
SPARQL
query
in
ARQ
goes
through
several

stages
of
processing:

–  String
to
Query
(parsing)

–  Transla'on
from
Query
to
a
SPARQL
algebra

expression

–  Op'miza'on
of
the
algebra
expression

–  Query
plan
determina'on
and
low-‐level

op'miza'on

–  Evalua'on
of
the
query
plan

14

SPARQL
algebra
features

•  SPARQL
Algebra1

1
hfp://www.w3.org/TR/sparql11-‐query/#sparqlQuery

15

SPARQL
algebra
features

DEFGHI,4)/48,9>$$'8JJ703%#<&)0J4)/4JA<BJ=,
KFLFMN,OHKNHPMN,.%/0+,.%"&1,QRFEF,S,
,,,.7,4)/4805)7,90/"3$)8'+(#)%:#+(;+(<&)0=,<,
,,,.7,4)/48%/0+,.%/0+,,
,,,TDNHTPUL,S,.7,4)/48%"&1,.%"&1,V
V

!"#$"%&$
'()*+&$,-.%/0+,.%"&12

3+4$*)"%

56'

$("'3+,
.7,
4)/4805)7,
90/"3$)8'+(#)%:#+(;+(<&)0=

56'

$("'3+,
.7,
4)/48%/0+,
.%/0+

$("'3+,
.7,
4)/48%"&1
.%"&1

$("'3+,56',*)"%,3+4$*)"%,<,<,<,<,'()*+&$,!"#$"%&$,<,<,<,<,!+'$>
,,?,,,,,@,,,A,,,,,,B,,,,,<,<,<,<,,,,B,,,,,,,B,,,,,<,<,<,<,,,C

16

Experiment
1

•  Model:
Support
Vector
Machine
regression

•  Evalua'on
measure:
R2

• 

Measures
how
well
future
samples
are
likely
to
be
predicted
by
the

model.

17

Experiment
1

•  Test
dataset
R2
=
0.004492

Log
scale
plomng
of
predicted
vs
actual
execu'on
'mes
for
the
test
queries.

18

Experiment
1

Some
of
the
long
running
queries
share
structurally

similar
basic
graph
paferns.

{

dbpedia
:1549
_Mikko
?p
?
uri
.

?
uri
rdf
:
type
?x

}

Challenge.
How
do
we
represent
basic
graph

paferns
as
vectors?

19

Basic
Graph
Pafern
Features

•  Inﬁnite
number
of
possibili'es
to
write
a
basic
graph

pafern
(BGP)

•  Only
the
set
of
literal
values
and
the
set
of
resources

appearing
in
the
RDF
graph

–  Exponen'al
number
of
possibili'es

–  A
graph
with
n
triples
has
2n subsets
of
triples

•  Feature
vector
with
exponen'al
number
of
dimensions

–  Not
feasible

20

Basic
Graph
Pafern
Features

•  Pafern
graph
=
RDF
graph
constructed
from

all
the
BGPs
in
a
query

–  Replace
variables
with
a
ﬁxed
symbol
‘?’

•  Cluster
the
training
queries
based
on
pafern

graph
similari'es

•  Create
a
vector
with
similarity
scores
between

the
pafern
graph
of
the
query
and
the
queries

in
the
cluster
centers.

21

•  Graph
Edit
Distance

–  Minimum
amount
of
distor'on
needed
to

transform
one
graph
to
another

–  Compute
similarity
by
inversing
distance

22

•  Graph
Edit
Distance

–  Usually
computed
using
A*
search

•  Exponen'al
running
'me

–  Bipar'te
matching
based
approximated
graph
edit

distance
with

•  Previous
research
shows
very
accurate
results
with

classiﬁca'on
problems

23

•  Clustering
Training
Queries

–  K-‐mediods
clustering
algorithm
with
approximated

edit
distance
as
distance
func'on

•  Selects
data
points
as
cluster
centers

•  Arbitrary
distance
func'on

24

Experiment
2

•  Model:
Support
Vector
Machine
regression

•  Test
dataset
R2
=
0.124204

•  K
=
10

Algebra
features

Algebra
+
BGP
features

25

Mul'ple
Regressions

•  We
train
different
SMV
regressions
for

different
'me
ranges.

•  The
variance
in
y-‐axis
is
less
for
each

regression,
easier
to
fit
a
curve.

26

•  Diﬀerent
'me
ranges

–  Clustering
the
execu'on
'me
ranges

•  We
use
x-‐means
clustering
algorithm
which

automa'cally
es'mates
the
number
of
clusters

–  5
clusters
found
in
the
training
dataset

–  Each
cluster
contains
queries
with
similar

execu'on
'mes

27

•  Predic'ng
execu'on
'me
range

–  Predict
the
corresponding
clusters
for
unseen

queries.

–  How

•  Train
a
SMV
classiﬁer
with
the
found
clusters
as
labels

•  Classify
unseen
queries:
accuracy
of
96%
for
the
test

dataset

•  This
means
we
can
accurately
predict
'me
ranges

28

•  Predic'ng
execu'on
'me

–  Diﬀerent
SMV
regressions
for
diﬀerent
'me

ranges.

–  Use
the
corresponding
regression
to
the
'me

range
cluster
for
an
unseen
query

29

Experiment
3

•  Test
dataset
R2
=
0.83862

Algebra
+
BGP
features

Mul'ple
regressions

30

Predic'ng
with
nearest
neighbors

regression

•  The
k-‐nearest
neighbors
algorithm
(k-‐NN)
is

oAen
successful
in
the
cases
where
decision

boundary
is
irregular.

•  We
train
a
k-‐NN
with

–  Euclidean
distance
as
the
distance
func'on

–  Distance
weigh'ng:
weighted
by
the
inverse
of
the

distance

31

•  k-‐dimensional
tree
(k-‐d
tree)
data
structure
to

search
the
nearest
neighbors

–  a
space-‐par''oning
data
structure
for
organizing

points
in
a
k-‐dimensional
space

•  Complexity
of
a
search:
O(log N)
opera'ons

32

Experiment
4

•  Test
dataset
R2
=
0.837

•  k=2
for
k-‐NN
(selected
by
cross
valida'on)

Mul'ple
regressions

k-‐NN

33

•  Future
work

–  Training
data
with
broad
coverage

•  DBpedia
SPARQL
benchmark
query
templates

–  Berlin:
5
templates

–  DBPSB:
20
templates

–  Fine
tuning
with
more
cross
valida'on

34

SUGGESTING
SPARQL
QUERIES

35

Sugges'ng
SPARQL
queries
based
on

query
history

•  Use
the
same
features

•  Construct
a
k-‐d
tree
for
nearest
neighbor

search

•  Top
M neighbors
for
a
query
are
the
top
M

sugges'ons
for
that
query

36

Example

SELECT
DISTINCT
?uri

WHERE

{

dbpedia
:1549
_Mikko
?p
?
uri
.

?
uri
rdf
:
type
?x

}

Sugges'on
1

SELECT
DISTINCT
?uri

WHERE

{

dbpedia
:
Radu_Sabo
?p
?
uri
.

?
uri
rdf
:
type
?x

}

Sugges'on
2

SELECT
DISTINCT
?uri

WHERE

{

dbpedia
:
Hafar_Al
-‐
Ba'n
?p
?
uri
.

?
uri
rdf
:
type
?x

}

Sugges'on
3

SELECT
DISTINCT
?uri

WHERE

{

dbpedia
:
Maurice_D
._G.
_Scof
?p
?
uri
.

?
uri
rdf
:
type
?x

}

37

•  Future
work

–  Query
construc'on
and
reﬁnement
workﬂow

•  How
to
use
the
query
sugges'ons?

–  Evalua'ng
the
sugges'ons

•  User
study

38

Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

More Related Content

What's hot

Similar to Predicting SPARQL query execution time and suggesting SPARQL queries based on query history

Recently uploaded

Predicting SPARQL query execution time and suggesting SPARQL queries based on query history