A sweet affordable combo for Linked Data Archives

A sweet aﬀordable combo for
Linked Data Archives
Miel Vander Sande

Children are sad because they didn’t
get the information they needed.
Sinterklaas got a burn-out.

Also Linked Data Archives have 
this sustainability problem.
Many data publishing institutions are  
under-resourced. 
Many of them care about data history.
Looking for “good-enough” solutions
Commonly resort to data dumps
Not able to aﬀord SPARQL infrastructure

Also Linked Data Archives have 
this sustainability problem.
Many clients asking complex queries  
is very expensive for a server to scale.
Access to data history makes this  
problem harder.
Unavailable servers prevent applications  
to unlock potential.

Pragmatic archiving with HDT
Sustainable querying with  
Triple Pattern Fragments
Uniform access to history with Memento
A sweet affordable combo for  
Time travelling through DBpedia

Single archive ﬁle (*.hdt)
Header-Dictionary-Triples (HDT) is a
compact binary RDF representation.
Header
Dictionary Triples
Created by Fernández, Javier et.al (He should be in this room…)

Features of HDT are desired properties
for digital archives.
Represent massive data sets as a single file
Rapid search for ?subject ?predicate ?object
Included header with dataset metadata
High volumes
Direct access
Discovery and exchange

HDT At0
HDT At-1
HDT At-2
HDT At-3
HDT At-4
HDT Bt0
HDT Bt-1
HDT Bt-2
HDT Zt0
HDT Zt-1
HDT Zt-2
HDT Zt-3
HDT Zt-4
…t0
Dataset A Dataset B Dataset Z
t-1
t-2
t-3
t-4
A matrix of HDT files can serve as 
pragmatic RDF archive.
Time-based index

14 DBpedia versions take 12.75%  
of the original N-triples size.
0
40
80
120
160
2.0
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
2014
2015-04
2015-10
Original size in NT (GB) HDT size (GB)

Space and time-to-publish significantly
decreased for DBpedia.
Original HDT -based
Indexing Custom HDT-CPP
Indexing time ~ 24 hours per version ~ 4 hours per version
Storage MongoDB HDT binary ﬁles
Space 383 Gb 178 Gb
# Versions
10 versions:  
2.0 through 3.9
14 versions:  
2.0 through 2015-10
# Triples ~ 3 billion ~ 6 billion

Linked Data Fragments: hunting  
trade-offs between client & server.
high server costlow server cost
data 
dump
SPARQL 
endpoint
interface offered by the server
high availability low availability
high bandwidth low bandwidth
out-of-date data live data
low client costhigh client cost
Linked Data 
pages

low server cost
data 
dump
SPARQL 
query results
high availability
live data
Linked Data 
pages
triple pattern 
fragments
A triple pattern fragments interface 
is low-cost and enables clients to query.

A Triple Pattern Fragments interface 
acts as a gateway to an RDF source.
Client can only ask ?s ?p ?o patterns.
Decompose complex SPARQL queries 
on the client-side.
Low server cost, highly cacheable, 
but higher bandwidth and query time.

Usage of fragments.dbpedia.org is
steadily increasing.
#Requests
February 2015 September 2016
19.239.907
4.500.000

And still the API has 99.99%  
availability up to today.

The Memento Framework lets you
negotiate Web resources over time.

Any client can transparently  
navigate to a prior version.

data 
dump
SPARQL 
endpoint
Linked Data 
pages
No memento support 
High consumer cost
Memento support 
High consumer cost
High publisher cost 
Memento support difficult
For archives, interface granularity  
and design are even more important.

Directly compatible with Memento
data 
dump
SPARQL 
query results
Useful for the consumer (queryable)
Sustainable for publisher
Linked Data 
pages
triple pattern 
fragments
The Triple Pattern Fragments trade-off 
also pays off for archives.

Different HDT snapshots are exposed
through an LDF server with Memento
http://fragments.dbpedia.org

DBpedia pages can be made available
through a proxy.
http://dbpedia.org/resource/…

Preparing the TPF client is simply  
adding an HTTP header.
Query Engine 
SPARQL Processing
Hypermedia Layer 
Fragments interaction
HTTP Layer 
Resource access
Dataset B Dataset A
303 Location
200 Content-Location (CORS)
Client
Server
GET Accept-Datetime

A self-descriptive interface results  
in a single datetime negotiation.
Query Engine 
SPARQL Processing
Hypermedia Layer 
Fragments interaction
HTTP Layer 
Resource access
Dataset B Dataset A
Client
Server
GET200

There is a huge amount of interesting
information in the history of  
Linked Data.
What could we learn if we could
easily query it?

Querying history and the evolution
of facts.
When did a researcher with name  
Frederik H. Kreuger and  
born in Amsterdam die?
Try it yourself: 
bit.ly/frederikkreuger 
bit.ly/frederikkreuger-2013

What predicates were added in DBpedia  
between 2009 and 2014 to describe  
a person?
Analyze and profile changes  
in a data.
Try it yourself:
bit.ly/personpredicates-2009
bit.ly/personpredicates-2014

What works by cubists were known by  
DBpedia and VIAF in 2009?
Resolve out-of-sync issues between
federated sources.
Try it yourself: 
bit.ly/workscubists-2009
bit.ly/workscubists

Start hosting your own Linked Data  
archive (or play with the DBpedia one)!
github.com/LinkedDataFragments 
bit.ly/conﬁguring-memento
www.rdfhdt.org
linkeddatafragments.org 
mementoweb.org
Software
Documentation and speciﬁcation
fragments.mementodepot.org
Query the DBpedia archive on

A sweet aﬀordable combo for
@Miel_vds 
Herbert Van de Sompel 
Harihar Shankar  
Lyudmila Balakireva 
Ruben Verborgh

A sweet affordable combo for Linked Data Archives

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Viewers also liked

Viewers also liked (9)

Similar to A sweet affordable combo for Linked Data Archives

Similar to A sweet affordable combo for Linked Data Archives (20)

More from Miel Vander Sande

More from Miel Vander Sande (15)

Recently uploaded

Recently uploaded (20)

A sweet affordable combo for Linked Data Archives