Querying datasets on the Web with high availability
1. Querying datasets on the Web
with high availability
Ruben Verborgh, Olaf Hartig, Ben De Meester, Gerald Haesendonck,
Laurens De Vocht, Miel Vander Sande, Richard Cyganiak, Pieter Colpaert,
Erik Mannens, and Rik Van de Walle
2. Somebody comes to you
with a high-quality
Linked Open Data set:
“How can people query
my Linked Data on the Web?”
3. “A public SPARQL endpoint
gives live querying, but it’s costly
and has availability issues.”
“Offer a data dump.
but it’s not really Web querying:
users need to set up an endpoint”
“Publish Linked Data documents.
But querying is very slow…”
4. Querying Linked Data
on the Web
always involves trade-offs.
But have we looked
at all possible trade-offs?
5. Querying Linked Data
live on the Web
becomes affordable
by building simpler servers
and more intelligent clients.
6. We formalized and evaluated
client-side SPARQL querying
through a triple-pattern interface.
The trade-offs are different:
low-cost, high-availability servers,
more bandwidth and slower queries.
7. Querying datasets on the Web
with high availability
Trade-offs for the Semantic Web
Triple pattern fragment querying
Evaluating the trade-offs
8. Querying datasets on the Web
with high availability
Trade-offs for the Semantic Web
Triple pattern fragment querying
Evaluating the trade-offs
9. It is not an all-or-nothing world.
There is a spectrum of trade-offs.
out-of-date data live data
high bandwidth low bandwidth
high availability low availability
high client cost low client cost
low server cost high server cost
data
dump
SPARQL
endpoint
Linked Data
documents
interface offered by the server
10. Linked Data Fragments are
a uniform view on Linked Data interfaces.
data
dump
SPARQL
endpoint
Every Linked Data interface
offers specific fragments
of a Linked Data set.
Linked Data
documents
interface offered by the server
11. Each type of Linked Data Fragment
is defined by three characteristics.
data
metadata
controls
What triples does it contain?
What do we know about it?
How to access more data?
12. Each type of Linked Data Fragment
is defined by three characteristics.
all dataset triples
(none)
data dump
number of triples, file size
data
metadata
controls
13. Each type of Linked Data Fragment
is defined by three characteristics.
SPARQL query result
data
triples matching the query
(none)
(none)
metadata
controls
14. Each type of Linked Data Fragment
is defined by three characteristics.
Linked Data Document
data
triples about a subject
creator, maintainer, …
links to other LD documents
metadata
controls
15. We designed a new trade-off mix
with low cost and high availability.
out-of-date data live data
high bandwidth low bandwidth
high availability low availability
high client cost low client cost
low server cost high server cost
data
dump
SPARQL
query results
Linked Data
documents
16. A triple pattern fragments interface
is low-cost and enables clients to query.
live data
low server cost
data
dump
SPARQL
query results
high availability
Linked Data
documents
triple pattern
fragments
17. A triple pattern fragments interface
is low-cost and enables clients to query.
matches of a triple pattern
total number of matches
access to all other fragments
data
metadata
controls
(paged)
19. Triple patterns are not the final answer.
No interface ever will be.
Triple patterns show how far we can get
with simple servers and smart clients.
data
dump
SPARQL
query results
Linked Data
documents
triple pattern
fragments
20. Querying datasets on the Web
with high availability
Trade-offs for the Semantic Web
Triple pattern fragment querying
Evaluating the trade-offs
21. Triple pattern fragment servers
enable clients to be intelligent.
The HTML representation explains:
“you can query by triple pattern”.
controls
22. Triple pattern fragment servers
enable clients to be intelligent.
<http://fragments.dbpedia.org/2014/en#dataset> hydra:search [
hydra:template "http://fragments.dbpedia.org/2014/en
controls
{?subject,predicate,object}";
hydra:mapping
[ hydra:variable "subject"; hydra:property rdf:subject ],
[ hydra:variable "predicate"; hydra:property rdf:predicate ],
[ hydra:variable "object"; hydra:property rdf:object ]
].
The RDF representation explains:
“you can query by triple pattern”.
23. Triple pattern fragment servers
enable clients to be intelligent.
The HTML representation explains:
“this is the number of matches”.
metadata
24. Triple pattern fragment servers
enable clients to be intelligent.
<#fragment> void:triples 8141.
The RDF representation explains:
“this is the number of matches”.
metadata
25. How can intelligent clients
solve SPARQL queries over fragments?
Give them a SPARQL query.
Give them a URL of any dataset fragment.
They look inside the fragment
to see how to access the dataset
and use the metadata
to decide how to plan the query.
26. Let’s follow the execution
of an example SPARQL query.
Find names of artists born in Padua, Italy.
SELECT ?artist ?name WHERE {
?artist a dbpedia-owl:Artist;
rdfs:label ?name;
dbpedia-owl:birthPlace dbpedia:Padua.
FILTER LANGMATCHES(LANG(?name), "EN")
}
Fragment: http://fragments.dbpedia.org/2014/en
27. The client looks inside the fragment
to see how to access the dataset.
Fragment: http://fragments.dbpedia.org/2014/en
<http://fragments.dbpedia.org/2014/en#dataset> hydra:search [
hydra:template "http://fragments.dbpedia.org/2014/en
{?subject,predicate,object}";
hydra:mapping
[ hydra:variable "subject"; hydra:property rdf:subject ],
[ hydra:variable "predicate"; hydra:property rdf:predicate ],
[ hydra:variable "object"; hydra:property rdf:object ]
].
“I can query the dataset by triple pattern.”
28. The client splits the query
into the available fragments.
SELECT ?artist ?name WHERE {
?artist a dbpedia-owl:Artist;
rdfs:label ?name;
dbpedia-owl:birthPlace dbpedia:Padua.
FILTER LANGMATCHES(LANG(?name), "EN")
}
29. The client gets the fragments
and inspects their metadata.
?artist a dbpedia-owl:Artist.
first 100 triples
96.000
?artist rdfs:label ?name.
first 100 triples
12.000.000
?artist dbont:birthPlace dbpedia:Padua.
first 100 triples
135
30. The metadata enables the client
to choose the right starting point.
?dbp:artist Alberto_a dbpedia-Benettin owl:a Artist. dbont:Artist.
96.000
?artist rdfs:label ?name. 12.000.000
dbp:Alberto_Benettin rdfs:label ?name.
?artist dbont:birthPlace dbpedia:Padua.
135
dbpedia:Alberto_Benettin dbont:birthPlace dbpedia:Padua.
dbpedia:Alberto_Bigon dbont:birthPlace dbpedia:Padua.
31. Clients execute the query in 3 seconds
on a highly available, low-cost server.
SELECT ?artist ?name WHERE {
?artist a dbpedia-owl:Artist;
rdfs:label ?name;
dbpedia-owl:birthPlace dbpedia:Padua.
FILTER LANGMATCHES(LANG(?name), "EN")
}
Try it yourself:
bit.ly/artistspadua
32. Querying datasets on the Web
with high availability
Trade-offs for the Semantic Web
Triple pattern fragment querying
Evaluating the trade-offs
33. We evaluated triple pattern fragments
for server cost and availability.
Berlin SPARQL benchmark
on Amazon EC2 machines
100 million triples
high query diversity
BGP, UNION, FILTER, …
34. We evaluated triple pattern fragments
for server cost and availability.
Berlin SPARQL benchmark
on Amazon EC2 machines
1 server
1 cache
1–240 simultaneous clients
35. The query throughput is lower,
but resilient to high client numbers.
Querying Datasets on 1 10 100 10 100 1000 10000 clients
throughput (q/hr)
Virtuoso 6 Fuseki–tdb triple pattern Fig. 3.1: Server performance (log-log plot)
200
executed SPARQL queries per hour
timeouts
36. The server traffic is higher,
but requests are significantly lighter.
Datasets on the Web with High Availability 13
Virtuoso 6 Virtuoso 7
Fuseki–tdb Fuseki–hdt
pattern fragments
1 10 100
4
2
0
clients
data sent (mb)
Fig. 3.2: Server network traffic
data sent by server in MB
37. 2
Caching is significantly more effective,
as clients reuse fragments for queries.
1 10 100
0
clients
sent (mb)
Fig. 3.2: Server network traffic
1 10 100
20
10
0
clients
sent (mb)
Fig. 3.4: Cache network traffic
ram use data sent by cache in MB
8
6
4
38. 150
The server uses much less CPU,
allowing for higher availability.
clients
1 10 100
1 1 10 100
100
50
0
100
50
1 50
server CPU usage per core
#timeouts
Fig. 3.3: Query timeouts
0
clients
cpu use (%)
Fig. 3.5: Server processor usage per core
100
use (%)
39. 150
Servers enable clients to be intelligent,
so they remain simple and light-weight.
1 10 100
1 1 10 100
100
50
0
clients
#timeouts
Fig. 3.3: Query timeouts
100
50
0
clients
cpu use (%)
1 Fig. 3.5: Server processor usage per core
50
100
use (%)
server CPU usage per core
40. Querying datasets on the Web
with high availability
Trade-offs for the Semantic Web
Triple pattern fragment querying
Evaluating the trade-offs
41. Triple pattern fragments allow
querying live data with high availability.
live data
low server cost
data
dump
SPARQL
query results
high availability
Linked Data
documents
triple pattern
fragments
42. Like each Linked Data query method,
triple pattern fragments have trade-offs.
Live data
Low-cost and highly available servers
More bandwidth
Slower results (but they are streaming)
43. Experience the trade-offs yourself
on the official DBpedia interfaces.
DBpedia data dump
DBpedia Linked Data documents
DBpedia SPARQL endpoint
DBpedia triple pattern fragments
fragments.dbpedia.org
44. Triple pattern fragments are easy:
all software is available as open source.
Software
github.com/LinkedDataFragments
Documentation and specification
linkeddatafragments.org
45. Querying is a client–server dialogue.
Start exploring the entire spectrum.
data
dump
SPARQL
query results
Linked Data
documents
triple pattern
fragments
46. Querying datasets on the Web
with high availability
linkeddatafragments.org
@LDFragments