Querying datasets on the Web with high availability

Querying datasets on the Web
with high availability
Ruben Verborgh, Olaf Hartig, Ben De Meester, Gerald Haesendonck,
Laurens De Vocht, Miel Vander Sande, Richard Cyganiak, Pieter Colpaert,
Erik Mannens, and Rik Van de Walle

Somebody comes to you
with a high-quality
Linked Open Data set:
“How can people query
my Linked Data on the Web?”

“A public SPARQL endpoint
gives live querying, but it’s costly
and has availability issues.”
“Offer a data dump.
but it’s not really Web querying:
users need to set up an endpoint”
“Publish Linked Data documents.
But querying is very slow…”

Querying Linked Data
on the Web
always involves trade-offs.
But have we looked
at all possible trade-offs?

Querying Linked Data
live on the Web
becomes affordable
by building simpler servers
and more intelligent clients.

We formalized and evaluated
client-side SPARQL querying
through a triple-pattern interface.
The trade-offs are different:
low-cost, high-availability servers,
more bandwidth and slower queries.

Trade-offs for the Semantic Web
Triple pattern fragment querying
Evaluating the trade-offs

It is not an all-or-nothing world.
There is a spectrum of trade-offs.
out-of-date data live data
high bandwidth low bandwidth
high availability low availability
high client cost low client cost
low server cost high server cost
data
dump
SPARQL
endpoint
Linked Data
documents
interface offered by the server

Linked Data Fragments are
a uniform view on Linked Data interfaces.
data
dump
SPARQL
endpoint
Every Linked Data interface
offers specific fragments
of a Linked Data set.
Linked Data
documents
interface offered by the server

Each type of Linked Data Fragment
is defined by three characteristics.
data
metadata
controls
What triples does it contain?
What do we know about it?
How to access more data?

all dataset triples
(none)
data dump
number of triples, file size
data
metadata
controls

SPARQL query result
data
triples matching the query
(none)
(none)
metadata
controls

Linked Data Document
data
triples about a subject
creator, maintainer, …
links to other LD documents
metadata
controls

We designed a new trade-off mix
with low cost and high availability.
out-of-date data live data
high bandwidth low bandwidth
high availability low availability
high client cost low client cost
low server cost high server cost
data
dump
SPARQL
query results
Linked Data
documents

A triple pattern fragments interface
is low-cost and enables clients to query.
live data
low server cost
data
dump
SPARQL
query results
high availability
Linked Data
documents
triple pattern
fragments

A triple pattern fragments interface
is low-cost and enables clients to query.
matches of a triple pattern
total number of matches
access to all other fragments
data
metadata
controls
(paged)

controls (other fragments)
metadata (total count)
data (first 100)

Triple patterns are not the final answer.
No interface ever will be.
Triple patterns show how far we can get
with simple servers and smart clients.
data
dump
SPARQL
query results
Linked Data
documents
triple pattern
fragments

Triple pattern fragment servers
enable clients to be intelligent.
The HTML representation explains:
“you can query by triple pattern”.
controls

<http://fragments.dbpedia.org/2014/en#dataset> hydra:search [
hydra:template "http://fragments.dbpedia.org/2014/en
controls
{?subject,predicate,object}";
hydra:mapping
[ hydra:variable "subject"; hydra:property rdf:subject ],
[ hydra:variable "predicate"; hydra:property rdf:predicate ],
[ hydra:variable "object"; hydra:property rdf:object ]
].
The RDF representation explains:
“you can query by triple pattern”.

The HTML representation explains:
“this is the number of matches”.
metadata

<#fragment> void:triples 8141.
The RDF representation explains:
“this is the number of matches”.
metadata

How can intelligent clients
solve SPARQL queries over fragments?
Give them a SPARQL query.
Give them a URL of any dataset fragment.
They look inside the fragment
to see how to access the dataset
and use the metadata
to decide how to plan the query.

Let’s follow the execution
of an example SPARQL query.
Find names of artists born in Padua, Italy.
SELECT ?artist ?name WHERE {
?artist a dbpedia-owl:Artist;
rdfs:label ?name;
dbpedia-owl:birthPlace dbpedia:Padua.
FILTER LANGMATCHES(LANG(?name), "EN")
}
Fragment: http://fragments.dbpedia.org/2014/en

The client looks inside the fragment
to see how to access the dataset.
Fragment: http://fragments.dbpedia.org/2014/en
<http://fragments.dbpedia.org/2014/en#dataset> hydra:search [
hydra:template "http://fragments.dbpedia.org/2014/en
{?subject,predicate,object}";
hydra:mapping
[ hydra:variable "subject"; hydra:property rdf:subject ],
[ hydra:variable "predicate"; hydra:property rdf:predicate ],
[ hydra:variable "object"; hydra:property rdf:object ]
].
“I can query the dataset by triple pattern.”

The client splits the query
into the available fragments.
rdfs:label ?name;
}

The client gets the fragments
and inspects their metadata.
?artist a dbpedia-owl:Artist.
first 100 triples
96.000
?artist rdfs:label ?name.
first 100 triples
12.000.000
?artist dbont:birthPlace dbpedia:Padua.
first 100 triples
135

The metadata enables the client
to choose the right starting point.
?dbp:artist Alberto_a dbpedia-Benettin owl:a Artist. dbont:Artist.
96.000
?artist rdfs:label ?name. 12.000.000
dbp:Alberto_Benettin rdfs:label ?name.
?artist dbont:birthPlace dbpedia:Padua.
135
dbpedia:Alberto_Benettin dbont:birthPlace dbpedia:Padua.
dbpedia:Alberto_Bigon dbont:birthPlace dbpedia:Padua.

Clients execute the query in 3 seconds
on a highly available, low-cost server.
rdfs:label ?name;
}
Try it yourself:
bit.ly/artistspadua

We evaluated triple pattern fragments
for server cost and availability.
Berlin SPARQL benchmark
on Amazon EC2 machines
100 million triples
high query diversity
BGP, UNION, FILTER, …

We evaluated triple pattern fragments
for server cost and availability.
Berlin SPARQL benchmark
on Amazon EC2 machines
1 server
1 cache
1–240 simultaneous clients

The query throughput is lower,
but resilient to high client numbers.
Querying Datasets on 1 10 100 10 100 1000 10000 clients
throughput (q/hr)
Virtuoso 6 Fuseki–tdb triple pattern Fig. 3.1: Server performance (log-log plot)
200
executed SPARQL queries per hour
timeouts

The server traffic is higher,
but requests are significantly lighter.
Datasets on the Web with High Availability 13
Virtuoso 6 Virtuoso 7
Fuseki–tdb Fuseki–hdt
pattern fragments
1 10 100
4
2
0
clients
data sent (mb)
Fig. 3.2: Server network traffic
data sent by server in MB

2
Caching is significantly more effective,
as clients reuse fragments for queries.
1 10 100
0
clients
sent (mb)
Fig. 3.2: Server network traffic
1 10 100
20
10
0
clients
sent (mb)
Fig. 3.4: Cache network traffic
ram use data sent by cache in MB
8
6
4

150
The server uses much less CPU,
allowing for higher availability.
clients
1 10 100
1 1 10 100
100
50
0
100
50
1 50
server CPU usage per core
#timeouts
Fig. 3.3: Query timeouts
0
clients
cpu use (%)
Fig. 3.5: Server processor usage per core
100
use (%)

150
Servers enable clients to be intelligent,
so they remain simple and light-weight.
1 10 100
1 1 10 100
100
50
0
clients
#timeouts
Fig. 3.3: Query timeouts
100
50
0
clients
cpu use (%)
1 Fig. 3.5: Server processor usage per core
50
100
use (%)
server CPU usage per core

Triple pattern fragments allow
querying live data with high availability.
live data
low server cost
data
dump
SPARQL
query results
high availability
Linked Data
documents
triple pattern
fragments

Like each Linked Data query method,
triple pattern fragments have trade-offs.
Live data
Low-cost and highly available servers
More bandwidth
Slower results (but they are streaming)

Experience the trade-offs yourself
on the official DBpedia interfaces.
DBpedia data dump
DBpedia Linked Data documents
DBpedia SPARQL endpoint
DBpedia triple pattern fragments
fragments.dbpedia.org

Triple pattern fragments are easy:
all software is available as open source.
Software
github.com/LinkedDataFragments
Documentation and specification
linkeddatafragments.org

Querying is a client–server dialogue.
Start exploring the entire spectrum.
data
dump
SPARQL
query results
Linked Data
documents
triple pattern
fragments

linkeddatafragments.org
@LDFragments

Querying datasets on the Web with high availability

More Related Content

What's hot

Similar to Querying datasets on the Web with high availability

Recently uploaded

Querying datasets on the Web with high availability