The Impact of Data Caching of on Query Execution for Linked Data

The Impact of
Data Caching on
Query Execution for Linked Data

Olaf Hartig
http://olafhartig.de/foaf.rdf#olaf
@olafhartig

Database and Information Systems Research Group
Humboldt-Universität zu Berlin

Can we query the Web of Data
as of it were a single,
giant database?

SELECT DISTINCT ?i ?label
WHERE {

?prof rdf:type <http://res ... data/dbprofs#DBProfessor> ;
foaf:topic_interest ?i .

}
OPTIONAL {

}
?i rdfs:label ?label
FILTER( LANG(?label)="en" || LANG(?label)="")

ORDER BY ?label
?

Our approach: Link Traversal Based Query Execution
[ISWC'09]
Olaf Hartig - The Impact of Data Caching on Query Execution for Linked Data 2

Main Idea
● Intertwine query evaluation with traversal of data links
● We alternate between:
● Evaluate parts of the query (triple patterns)
on a continuously augmented set of data
● Look up URIs in intermediate
solutions and add retrieved data
to the query-local dataset

query-local
dataset

Main Idea

Query
http://bob.name
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea

htt
p:/ ?

/bo

b.n
am
Look up URIs in intermediate

e
●


Query
http://bob.name
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea

htt
p:/ ?

/bo

b.n
am

e
●


Query
http://bob.name
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea

htt
p:/ ?

/bo

b.n
am

e
●

“Descriptor object”

Query
http://bob.name
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea

Query
http://bob.name
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea
http://bob.name
Query kno
ws
http://bob.name
http://alice.name
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea
?acq
http://alice.name
http://bob.name
Query kno
ws
http://bob.name
http://alice.name
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea
?acq
http://alice.name

? me

a
e.n
lic
a
://

p
htt

Query
http://bob.name
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea
?acq
http://alice.name

Query
http://bob.name
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea
?acq
http://alice.name
http://alice.name
Query pr o
http://bob.name jec
t
?prjName http://.../AlicesPrj
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea
?acq
http://alice.name
● Look up URIs in intermediate ?acq ?prj
http://alice.name http://.../AlicesPrj
http://alice.name
Query pr o
http://bob.name jec
t
?prjName http://.../AlicesPrj
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea
?acq
http://alice.name

Query
http://bob.name
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea
?acq
http://alice.name
to the query-local dataset ?prj ?prjName
http://.../AlicesPrj “…“
Query
http://bob.name
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea
?acq
http://alice.name
to the query-local dataset ?prj ?prjName
http://.../AlicesPrj “…“
Query ?acq ?prj ?prjName
http://bob.name
?prjName http://alice.name http://.../AlicesPrj “…“
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Characteristics
● Link traversal based query execution:
● Evaluation on a continuously augmented dataset
● Discovery of potentially relevant data during execution
● Discovery driven by intermediate solutions

● Main advantage:
● No need to know all data sources in advance

● Limitations:
● Query has to contain a URI as a starting point
●
Ignores data that is not reachable* by the query execution
*
formal definition in [LDOW'11a]

The Issue
Query
?acq interest
?i
s
ow

label
kn

http://bob.name
?iLabel

query-local
dataset


The Issue
Query
?acq interest
?i
s
ow

label
kn

http://bob.name
?iLabel

htt query-local
p: //b
ob dataset
? .nam
e


The Issue
Query
?acq interest http://bob.name
?i
kno
s
ow

w s

label
kn

http://alice.name
http://bob.name
?iLabel

query-local
dataset


The Issue
Query
?acq interest http://bob.name
?i
kno
s
ow

w s

label
kn

http://alice.name
http://bob.name
?iLabel

query-local
dataset

?acq ?i ?iLabel


The Issue
Query
?acq interest
?i
s
ow

label
kn

http://bob.name
?iLabel

query-local
dataset

Query
http://bob.name
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset


Reusing the Query-Local Dataset
Query
?acq interest
?i
s
ow

label
kn

http://bob.name
?iLabel

query-local
dataset

Query
http://bob.name
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset


Query
?acq interest
?i
s
ow

label
kn

http://bob.name
?iLabel

http://alice.name

o ws
Query kn
http://bob.name
http://bob.name
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset


Query
?acq interest
?i ?acq
s
ow

http://alice.name

label
kn

http://bob.name
?iLabel

http://alice.name

o ws
Query kn
http://bob.name
http://bob.name
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset


Hypothesis

Re-using the query-local dataset (a.k.a. data caching)
may benefit
query performance + result completeness


Contributions
● Systematic analysis of the impact of data caching
●
Theoretical foundation*
●
Conceptual analysis*
● Empirical evaluation of the potential impact

*
see [LDOW'11a]

● Out of scope: Caching strategies (replacement, invalidation)


Experiment – Scenario

● Information about the
distributed social
network of FOAF
profiles
● 5 types of queries
● Experiment Setup:
● 20 persons
● Sequential use
➔ 100 queries


Experiment – Single Query
no reuse reuse 0 10 20 30 40 50 60
query
per ● no reuse experiment:
0,01 0,1 1 10 100

ContactInfoDanBri
● No data caching
(Query No. 61) ● reuse per query experiment
UnsetPropsDanBri ● Reuse of query-local dataset
(Query No. 62) for 3 executions of each query
2ndDegree1DanBri
● Third execution measured
(Query No. 63)

2ndDegree2DanBri
(Query No. 64)

IncomingDanBri
(Query No. 65)
0 10 20 30 40 50 60 0,01 0,1 1 10 100
number of query results query execution time (in seconds)


no reuse reuse 0 10 20 30 40 50 60
query
per ● no 0,01
reuse experiment:
0,1 1 10 100

ContactInfoDanBri
● No data caching
(Query No. 61) ● reuse per query experiment
UnsetPropsDanBri ● Reuse of query-local dataset
(Query No. 62) for 3 executions of each query
2ndDegree1DanBri
● Third execution measured
(Query No. 63)

2ndDegree2DanBri
(Query No. 64)

IncomingDanBri
(Query No. 65)
0 10 20 30 40 50 60 0,01 0,1 1 10 100


no reuse reuse 0 10 20 30 40 50 60
per 0,01 0,1 1 10 100
query

ContactInfoDanBri
(Query No. 61)

UnsetPropsDanBri
(Query No. 62)

2ndDegree1DanBri
(Query No. 63)

2ndDegree2DanBri
(Query No. 64)

IncomingDanBri
(Query No. 65)
0 10 20 30 40 50 60 0,01 0,1 1 10 100


Experiment – Complete Sequence
no reuse reuse 0 10 20 30 all 50 60
query
per reuse 40
queries
● reuse all queries experiment: 100
0,01 0,1 1 10

ContactInfoDanBri
● Reuse of the query-local
(Query No. 61) dataset for the complete
sequence of all 100 queries
UnsetPropsDanBri
(Query No. 62)

2ndDegree1DanBri
(Query No. 63)

2ndDegree2DanBri
(Query No. 64)

IncomingDanBri
(Query No. 65)
0 10 20 30 40 50 60 0,01 0,1 1 10 100
number of query results query execution time
(in seconds)

per reuse 40 0,01 0,1 1 10 100
query queries

ContactInfoDanBri
(Query No. 61)

UnsetPropsDanBri
(Query No. 62)

2ndDegree1DanBri
(Query No. 63)

2ndDegree2DanBri
(Query No. 64)

IncomingDanBri
(Query No. 65)
0 10 20 30 40 50 60 0,01 0,1 1 10 100
(in seconds)

Outlook

● Requirements of a data cache:
● Replacement mechanism
● Coherency mechanism


Cache Replacement
● Cache full → remove descriptor objects
● Replacement strategy
● Primary goal: maximize hit rate
● Recency-based
● Frequency-based
● Function-based
● Randomized
● Replacement process
● Watermarks: high and low


Studying Cache Replacement?



“Web cache replacement in its general
form seems to be a solved topic.”
S. Podlipnig and L. Böszörmenyi: Survey of
Web Cache Replacement Strategies, 2003



“Web cache replacement in its general
form seems to be a solved topic.”
S. Podlipnig and L. Böszörmenyi: Survey of
Web Cache Replacement Strategies, 2003

●
6 quad indexes* in main memory
● Size grows linear in the number of quads
● Example (after reuse all queries experiment, 100 queries):
● 905 descriptor objects, overall number of 745,756 triples
● ca. 103 MB
➔ Available main memory is almost no limit
*
as introduced in [LDOW'11b]

Cache Coherency
● Data items in the cache may become inconsistent
● Strong cache consistency
● Server validation
● Client validation
● Weak cache consistency
● Time to live (TTL)
● Adaptive TTL


Client Validation
● Polling every time
● Enables strong cache consistency
● Conditional GET
● Request with If-Modified-Since header
● Possible response: 304 Not Modified


Client Validation
● Polling every time
● Enables strong cache consistency
● Conditional GET
● Request with If-Modified-Since header
● Possible response: 304 Not Modified
● Not supported by most Linked Data servers
● Experiment based on the CKAN catalog of linked datasets

● 41 out of 154 example resources (26.6%) from 110 datasets


Time to Live (TTL)
● TTL field: life time estimation for each object
● Supported by HTTP response headers:
● Expires
● Cache-Control: max-age
● When TTL elapses, object is invalid
● Accessing an invalid object → re-retrieve object again
● Conditional GET


Time to Live (TTL)
● TTL field: life time estimation for each object
● Supported by HTTP response headers:
● Expires 37.0%
● Cache-Control: max-age 37.7%
● When TTL elapses, object is invalid
● Accessing an invalid object → re-retrieve object again
● Conditional GET 26.6%
● Alternative (due to lack of support in Linked Data servers):
● Assume a default TTL for each object
● Ordinary GET


Adaptive TTL
● Assumption:
● The older an object, the less likely it is to be modified
● TTL is a percentage of the age:
● Threshold = 10% ; age = 30 days → TTL = 3 days
● Last verification: yesterday → invalidation in 2 days
● HTTP-based implementation:
● Calculation of age: use Last-Modified response header
● Verification with conditional GET


Adaptive TTL
● Assumption:
● The older an object, the less likely it is to be modified
● TTL is a percentage of the age:
● Threshold = 10% ; age = 30 days → TTL = 3 days
● Last verification: yesterday → invalidation in 2 days
● HTTP-based implementation: 35.1%
● Calculation of age: use Last-Modified response header
● Verification with conditional GET
● Alternative (due to lack of support in Linked Data servers):
● Assume Last-Modified is time of first retrieval
● Verification by comparing a response to the current version

Summary
● Systematic analysis of the impact of data cache
● Theoretical foundation
● Conceptual analysis
● Empirical evaluation
● Main findings:
● Additional results possible (for semantically similar queries)
● Impact on performance may be positive but also negative
● Future work:
● Analysis of caching strategies in our context
● Main issue: invalidation


Backup Slides


Contributions
● Theoretical foundation (extension of the original definition)
● Reachability by a Dseed-initialized execution of a BGP query b
● Dseed-dependent solution for a BGP query b
● Reachability R(B) for a serial execution of B = b1 , … , bn
➔ Each solution for bcur is also R(B)-dependent solution for bcur
● Conceptual analysis of the impact of data caching
● Performance factor: p( bcur , B ) = c( bcur , [ ] ) – c( bcur , B )
● Serendipity factor: s( bcur , B ) = b( bcur , B ) – b( bcur , [ ] )
● Empirical verification of the potential impact

● Out of scope: Caching strategies (replacement, invalidation)

Query Template Contact
SELECT * WHERE { <PERSON> foaf:knows ?p .

OPTIONAL { ?p foaf:name ?name }
OPTIONAL { ?p foaf:firstName ?firstName }
OPTIONAL { ?p foaf:givenName ?givenName }
OPTIONAL { ?p foaf:givenname ?givenname }
OPTIONAL { ?p foaf:familyName ?familyName }
OPTIONAL { ?p foaf:family_name ?family_name }
OPTIONAL { ?p foaf:lastName ?lastName }
OPTIONAL { ?p foaf:surname ?surname }

OPTIONAL { ?p foaf:birthday ?birthday }

OPTIONAL { ?p foaf:img ?img }

OPTIONAL { ?p foaf:phone ?phone }
OPTIONAL { ?p foaf:aimChatID ?aimChatID }
OPTIONAL { ?p foaf:icqChatID ?icqChatID }
OPTIONAL { ?p foaf:jabberID ?jabberID }
OPTIONAL { ?p foaf:msnChatID ?msnChatID }
OPTIONAL { ?p foaf:skypeID ?skypeID }
OPTIONAL { ?p foaf:yahooChatID ?yahooChatID }
}


Query Template UnsetProps
SELECT DISTINCT ?result ?resultLabel WHERE
{
?result rdfs:isDefinedBy <http://xmlns.com/foaf/0.1/> .
?result rdfs:domain foaf:Person .

OPTIONAL { <PERSON> ?result ?var0 }
FILTER ( !bound(?var0) )

<PERSON> foaf:knows ?var2 .
?var2 ?result ?var3 .
?result rdfs:label ?resultLabel .
?result vs:term_status ?var1 .
}
ORDER BY ?var1


Query Template Incoming
SELECT DISTINCT ?result WHERE
{
?result foaf:knows <PERSON> .

OPTIONAL
{
?result foaf:knows ?var1 .
FILTER ( <PERSON> = ?var1 )
<PERSON> foaf:knows ?result .
}
FILTER ( !bound(?var1) )
}


Query Template 2ndDegree1
{
<PERSON> foaf:knows ?p1 .
FILTER ( ?p1 != ?p2 )

?p1 foaf:knows ?result .
FILTER ( <PERSON> != ?result )
?p2 foaf:knows ?result .

OPTIONAL {
<PERSON> ?knows ?result .
FILTER ( ?knows = foaf:knows )
}
FILTER ( !bound(?knows) )
}

Query Template 2ndDegree2
{
FILTER ( ?p1 != ?p2 )

?result foaf:knows ?p1 .
FILTER ( <PERSON> != ?result )
?result foaf:knows ?p2 .

OPTIONAL {
<PERSON> ?knows ?result .
FILTER ( ?knows = foaf:knows )
}
FILTER ( !bound(?knows) )
}

Experiment Avg.1 number of Average1 Avg.1 query
Query Results Hit Rate Execution Time

(std.dev.) (std.dev.) (std.dev.)
27.37 0.849 64.95 s
no reuse
(140.49) (0.205) (124.50)
26.71 1 0.02 s
reuse per query
(148.77) (0) (0.07)
1
Averaged over all 100 queries
● In the ideal case for Bupper= [ bcur , bcur ] :
● pupper( bcur , Bupper ) = c( bcur , [ ] ) – c( bcur , Bupper ) = c( bcur , [ ] )
● supper( bcur , Bupper ) = b( bcur , Bupper ) – b( bcur , [ ] ) = 0



27.37 0.849 64.95 s
no reuse
(140.49) (0.205) (124.50)
26.71 1 0.02 s
reuse per query
(148.77) (0) (0.07)
1

● Summary (measurement errors aside):
● Same number of query results
● Significant improvements in query performance



27.37 0.849 64.95 s
no reuse
(140.49) (0.205) (124.50)
26.71 1 0.02 s
reuse per query
(148.77) (0) (0.07)
44.87 0.991 37.91 s
reuse all queries
(178.36) (0.053) (112.94)
1
● Summary:
● Data cache may provide for additional query results
● Impact on performance may be positive but also negative


27.37 0.849 64.95 s
no reuse
(140.49) (0.205) (124.50)
26.71 1 0.02 s
reuse per query
(148.77) (0) (0.07)
44.87 0.991 37.91 s
reuse all queries
(178.36) (0.053) (112.94)
reuse all queries 118.18 0.992 20.61 s
(random orders) (867.07) (0.016) (216.61)

● Executing the query sequence in a random order results in
measurements similar to the given order.

These slides have been created by
Olaf Hartig

http://olafhartig.de

This work is licensed under a
Creative Commons Attribution-Share Alike 3.0 License
(http://creativecommons.org/licenses/by-sa/3.0/)


The Impact of Data Caching of on Query Execution for Linked Data

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to The Impact of Data Caching of on Query Execution for Linked Data

Similar to The Impact of Data Caching of on Query Execution for Linked Data (20)

More from Olaf Hartig

More from Olaf Hartig (20)

Recently uploaded

Recently uploaded (20)

The Impact of Data Caching of on Query Execution for Linked Data