7. Children are sad because they didn’t
get the information they needed.
Sinterklaas got a burn-out.
8.
9. Also Linked Data Archives have
this sustainability problem.
Many data publishing institutions are
under-resourced.
Many of them care about data history.
Looking for “good-enough” solutions
Commonly resort to data dumps
Not able to afford SPARQL infrastructure
10. Also Linked Data Archives have
this sustainability problem.
Many clients asking complex queries
is very expensive for a server to scale.
Access to data history makes this
problem harder.
Unavailable servers prevent applications
to unlock potential.
11. Pragmatic archiving with HDT
Sustainable querying with
Triple Pattern Fragments
Uniform access to history with Memento
A sweet affordable combo for
Linked Data Archives
Time travelling through DBpedia
12. Pragmatic archiving with HDT
Sustainable querying with
Triple Pattern Fragments
Uniform access to history with Memento
A sweet affordable combo for
Linked Data Archives
Time travelling through DBpedia
13. Single archive file (*.hdt)
Header-Dictionary-Triples (HDT) is a
compact binary RDF representation.
Header
Dictionary Triples
Created by Fernández, Javier et.al (He should be in this room…)
14. Features of HDT are desired properties
for digital archives.
Represent massive data sets as a single file
Rapid search for ?subject ?predicate ?object
Included header with dataset metadata
High volumes
Direct access
Discovery and exchange
15. HDT At0
HDT At-1
HDT At-2
HDT At-3
HDT At-4
HDT Bt0
HDT Bt-1
HDT Bt-2
HDT Zt0
HDT Zt-1
HDT Zt-2
HDT Zt-3
HDT Zt-4
…t0
Dataset A Dataset B Dataset Z
t-1
t-2
t-3
t-4
A matrix of HDT files can serve as
pragmatic RDF archive.
Time-based index
16. 14 DBpedia versions take 12.75%
of the original N-triples size.
0
40
80
120
160
2.0
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
2014
2015-04
2015-10
Original size in NT (GB) HDT size (GB)
17. Space and time-to-publish significantly
decreased for DBpedia.
Original HDT -based
Indexing Custom HDT-CPP
Indexing time ~ 24 hours per version ~ 4 hours per version
Storage MongoDB HDT binary files
Space 383 Gb 178 Gb
# Versions
10 versions:
2.0 through 3.9
14 versions:
2.0 through 2015-10
# Triples ~ 3 billion ~ 6 billion
18. Pragmatic archiving with HDT
Sustainable querying with
Triple Pattern Fragments
Uniform access to history with Memento
A sweet affordable combo for
Linked Data Archives
Time travelling through DBpedia
19. Linked Data Fragments: hunting
trade-offs between client & server.
high server costlow server cost
data
dump
SPARQL
endpoint
interface offered by the server
high availability low availability
high bandwidth low bandwidth
out-of-date data live data
low client costhigh client cost
Linked Data
pages
20. low server cost
data
dump
SPARQL
query results
high availability
live data
Linked Data
pages
triple pattern
fragments
A triple pattern fragments interface
is low-cost and enables clients to query.
21. A Triple Pattern Fragments interface
acts as a gateway to an RDF source.
Client can only ask ?s ?p ?o patterns.
Decompose complex SPARQL queries
on the client-side.
Low server cost, highly cacheable,
but higher bandwidth and query time.
25. And still the API has 99.99%
availability up to today.
26. Pragmatic archiving with HDT
Sustainable querying with
Triple Pattern Fragments
Uniform access to history with Memento
A sweet affordable combo for
Linked Data Archives
Time travelling through DBpedia
28. Any client can transparently
navigate to a prior version.
29. Any client can transparently
navigate to a prior version.
30. data
dump
SPARQL
endpoint
Linked Data
pages
No memento support
High consumer cost
Memento support
High consumer cost
High publisher cost
Memento support difficult
For archives, interface granularity
and design are even more important.
31. Directly compatible with Memento
data
dump
SPARQL
query results
Useful for the consumer (queryable)
Sustainable for publisher
Linked Data
pages
triple pattern
fragments
The Triple Pattern Fragments trade-off
also pays off for archives.
32. Different HDT snapshots are exposed
through an LDF server with Memento
http://fragments.dbpedia.org
33. DBpedia pages can be made available
through a proxy.
http://dbpedia.org/resource/…
34. Preparing the TPF client is simply
adding an HTTP header.
Query Engine
SPARQL Processing
Hypermedia Layer
Fragments interaction
HTTP Layer
Resource access
Dataset B Dataset A
303 Location
200 Content-Location (CORS)
Client
Server
GET Accept-Datetime
35. A self-descriptive interface results
in a single datetime negotiation.
Query Engine
SPARQL Processing
Hypermedia Layer
Fragments interaction
HTTP Layer
Resource access
Dataset B Dataset A
Client
Server
GET200
36. Pragmatic archiving with HDT
Sustainable querying with
Triple Pattern Fragments
Uniform access to history with Memento
A sweet affordable combo for
Linked Data Archives
Time travelling through DBpedia
37. There is a huge amount of interesting
information in the history of
Linked Data.
What could we learn if we could
easily query it?
38. Querying history and the evolution
of facts.
When did a researcher with name
Frederik H. Kreuger and
born in Amsterdam die?
Try it yourself:
bit.ly/frederikkreuger
bit.ly/frederikkreuger-2013
39. What predicates were added in DBpedia
between 2009 and 2014 to describe
a person?
Analyze and profile changes
in a data.
Try it yourself:
bit.ly/personpredicates-2009
bit.ly/personpredicates-2014
40. What works by cubists were known by
DBpedia and VIAF in 2009?
Resolve out-of-sync issues between
federated sources.
Try it yourself:
bit.ly/workscubists-2009
bit.ly/workscubists
41. Start hosting your own Linked Data
archive (or play with the DBpedia one)!
github.com/LinkedDataFragments
bit.ly/configuring-memento
www.rdfhdt.org
linkeddatafragments.org
mementoweb.org
Software
Documentation and specification
fragments.mementodepot.org
Query the DBpedia archive on
42. A sweet affordable combo for
Linked Data Archives
@Miel_vds
Herbert Van de Sompel
Harihar Shankar
Lyudmila Balakireva
Ruben Verborgh