5. Pragmatic archiving with HDT
Sustainable querying with
Triple Pattern Fragments
Uniform access to history with Memento
Reproducibility with the
99 cents Linked Data archive
Time travelling through DBpedia
Reproducibility on the Web
6. Pragmatic archiving with HDT
Sustainable querying with
Triple Pattern Fragments
Uniform access to history with Memento
Reproducibility with the
99 cents Linked Data archive
Time travelling through DBpedia
Reproducibility on the Web
11. Publishing Linked Data Archives
has a sustainability problem.
Many data publishing institutions are
under-resourced.
Many of them care about data history.
Looking for “good-enough” solutions
Commonly resort to data dumps
Not able to afford public SPARQL infrastructure
12. Publishing Linked Data Archives
has a sustainability problem.
Many clients asking complex queries
is very expensive for a server to scale.
Access to data history makes this
problem harder.
Unavailable servers prevent applications
to unlock potential.
13. Pragmatic archiving with HDT
Sustainable querying with
Triple Pattern Fragments
Uniform access to history with Memento
Reproducibility with the
99 cents Linked Data archive
Time travelling through DBpedia
Reproducibility on the Web
14. Single archive file (*.hdt)
Header-Dictionary-Triples (HDT) is a
compact binary RDF representation.
Header
Dictionary Triples
Created by Fernández, Javier et.al
15. Features of HDT are desirable
properties for digital archives.
High volumes
Direct access
Discovery and exchange
Represent massive data sets as a single file
Rapid search for ?subject ?predicate ?object
Included header with dataset metadata
16. HDT At0
HDT Bt0
HDT Ct0
HDT Zt0
HDT At-1
HDT Bt-1
HDT Ct-1
HDT Zt-x
HDT Zt-x
HDT Zt-x
HDT Zt-x
…
t0
Dataset B
Dataset Z
t-1 t-x
A matrix of HDT files can serve as
pragmatic RDF archive.
Time-based index
…
Dataset A
Dataset C
…
17. 14 DBpedia versions take 12.75%
of the original N-triples size.
0
40
80
120
160
2.0
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
2014
2015-04
2015-10
Original size in NT (GB) HDT size (GB)
18. Space and time-to-publish significantly
decreased for DBpedia.
Original HDT -based
Indexing Custom HDT-CPP
Indexing time ~ 24 hours per version ~ 4 hours per version
Storage MongoDB HDT binary files
Space 383 Gb 178 Gb
# Versions
10 versions:
2.0 through 3.9
14 versions:
2.0 through 2015-10
# Triples ~ 3 billion ~ 6 billion
19. Pragmatic archiving with HDT
Sustainable querying with
Triple Pattern Fragments
Uniform access to history with Memento
Reproducibility with the
99 cents Linked Data archive
Time travelling through DBpedia
Reproducibility on the Web
20. Linked Data Fragments: hunting
trade-offs between client & server.
high server costlow server cost
data
dump
SPARQL
endpoint
interface offered by the server
high availability low availability
high bandwidth low bandwidth
out-of-date data live data
low client costhigh client cost
Linked Data
pages
21. low server cost
data
dump
SPARQL
query results
high availability
live data
Linked Data
pages
triple pattern
fragments
A triple pattern fragments interface
is low-cost and enables clients to query.
23. A Triple Pattern Fragments interface
acts as a gateway to an RDF source.
Client can only ask ?s ?p ?o patterns.
Decompose complex SPARQL queries
on the client-side.
Low server cost, highly cacheable,
but higher bandwidth and query time.
27. And still the API has 99.99%
availability up to today.
28. Pragmatic archiving with HDT
Sustainable querying with
Triple Pattern Fragments
Uniform access to history with Memento
Reproducibility with the
99 cents Linked Data archive
Time travelling through DBpedia
Reproducibility on the Web
30. Any client can transparently
navigate to a prior version.
31. Any client can transparently
navigate to a prior version.
32. data
dump
SPARQL
endpoint
Linked Data
pages
No memento support
High consumer cost
Memento support
High consumer cost
High publisher cost
Memento support difficult
For archives, interface granularity
and design are even more important.
33. Directly compatible with Memento
data
dump
SPARQL
query results
Useful for the consumer (queryable)
Sustainable for publisher
Linked Data
pages
triple pattern
fragments
The Triple Pattern Fragments trade-off
also pays off for archives.
34. Different HDT snapshots are exposed
through an LDF server with Memento
http://fragments.dbpedia.org
35. DBpedia pages can be made available
through a proxy.
http://dbpedia.org/resource/…
36. Preparing the TPF client is simply
adding an HTTP header.
Query Engine
SPARQL Processing
Hypermedia Layer
Fragments interaction
HTTP Layer
Resource access
Dataset B Dataset A
303 Location
200 Content-Location (CORS)
Client
Server
GET Accept-Datetime
37. A self-descriptive interface results
in a single datetime negotiation.
Query Engine
SPARQL Processing
Hypermedia Layer
Fragments interaction
HTTP Layer
Resource access
Dataset B Dataset A
Client
Server
GET200
38. Pragmatic archiving with HDT
Sustainable querying with
Triple Pattern Fragments
Uniform access to history with Memento
Reproducibility with the
99 cents Linked Data archive
Time travelling through DBpedia
Reproducibility on the Web
39. There is interesting information in the
history of
Linked Data / DBpedia.
What could we learn if we could
easily query it?
40. Querying history and the evolution
of facts.
When did a researcher with name
Frederik H. Kreuger and
born in Amsterdam die?
Try it yourself:
bit.ly/frederikkreuger
bit.ly/frederikkreuger-2013
41. What predicates were added in DBpedia
between 2009 and 2014 to describe
a person?
Analyze and profile changes
in a data.
Try it yourself:
bit.ly/personpredicates-2009
bit.ly/personpredicates-2014
42. What works by cubists were known by
DBpedia and VIAF in 2009?
Resolve out-of-sync issues between
federated sources.
Try it yourself:
bit.ly/workscubists-2009
bit.ly/workscubists
43. Start hosting your own Linked Data
archive (or play with the DBpedia one)!
github.com/LinkedDataFragments
bit.ly/configuring-memento
www.rdfhdt.org
linkeddatafragments.org
mementoweb.org
Software
Documentation and specification
fragments.mementodepot.org
Query the DBpedia archive on
44. Reproducibility with
the 99 cents Linked Data archive
@Miel_vds
Herbert Van de Sompel
Harihar Shankar
Lyudmila Balakireva
Ruben Verborgh