Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reproducibility with 
the 99 cents Linked Data archive

503 views

Published on

Publish 'good enough' data to achieve reproducibility

Published in: Technology
  • Be the first to comment

Reproducibility with 
the 99 cents Linked Data archive

  1. 1. Reproducibility with 
 the 99 cents Linked Data archive Miel Vander Sande

  2. 2. Reproducibility.
  3. 3. Pragmatic archiving with HDT Sustainable querying with 
 Triple Pattern Fragments Uniform access to history with Memento Reproducibility with the 99 cents Linked Data archive Time travelling through DBpedia Reproducibility on the Web
  4. 4. Pragmatic archiving with HDT Sustainable querying with 
 Triple Pattern Fragments Uniform access to history with Memento Reproducibility with the 99 cents Linked Data archive Time travelling through DBpedia Reproducibility on the Web
  5. 5. Reproducing experiments to sustain validity.
  6. 6. Reproducing experiments to sustain validity.
  7. 7. Backwards-compatible Linked Open Data applications. 1.0 2.0
  8. 8. Publishing Linked Data Archives 
 has a sustainability problem. Many data publishing institutions are 
 under-resourced.
 Many of them care about data history. Looking for “good-enough” solutions Commonly resort to data dumps Not able to afford public SPARQL infrastructure
  9. 9. Publishing Linked Data Archives 
 has a sustainability problem. Many clients asking complex queries 
 is very expensive for a server to scale. Access to data history makes this 
 problem harder. Unavailable servers prevent applications 
 to unlock potential.
  10. 10. Pragmatic archiving with HDT Sustainable querying with 
 Triple Pattern Fragments Uniform access to history with Memento Reproducibility with the 99 cents Linked Data archive Time travelling through DBpedia Reproducibility on the Web
  11. 11. Single archive file (*.hdt) Header-Dictionary-Triples (HDT) is a compact binary RDF representation. Header Dictionary Triples Created by Fernández, Javier et.al
  12. 12. Features of HDT are desirable 
 properties for digital archives. High volumes Direct access Discovery and exchange Represent massive data sets as a single file Rapid search for ?subject ?predicate ?object Included header with dataset metadata
  13. 13. HDT At0 HDT Bt0 HDT Ct0 HDT Zt0 HDT At-1 HDT Bt-1 HDT Ct-1 HDT Zt-x HDT Zt-x HDT Zt-x HDT Zt-x … t0 Dataset B Dataset Z t-1 t-x A matrix of HDT files can serve as
 pragmatic RDF archive. Time-based index … Dataset A Dataset C …
  14. 14. 14 DBpedia versions take 12.75% 
 of the original N-triples size. 0 40 80 120 160 2.0 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 2014 2015-04 2015-10 Original size in NT (GB) HDT size (GB)
  15. 15. Space and time-to-publish significantly decreased for DBpedia. Original HDT -based Indexing Custom HDT-CPP Indexing time ~ 24 hours per version ~ 4 hours per version Storage MongoDB HDT binary files Space 383 Gb 178 Gb # Versions 10 versions: 
 2.0 through 3.9 14 versions: 
 2.0 through 2015-10 # Triples ~ 3 billion ~ 6 billion
  16. 16. Pragmatic archiving with HDT Sustainable querying with 
 Triple Pattern Fragments Uniform access to history with Memento Reproducibility with the 99 cents Linked Data archive Time travelling through DBpedia Reproducibility on the Web
  17. 17. Linked Data Fragments: hunting 
 trade-offs between client & server. high server costlow server cost data
 dump SPARQL
 endpoint interface offered by the server high availability low availability high bandwidth low bandwidth out-of-date data live data low client costhigh client cost Linked Data
 pages
  18. 18. low server cost data
 dump SPARQL
 query results high availability live data Linked Data
 pages triple pattern
 fragments A triple pattern fragments interface
 is low-cost and enables clients to query.
  19. 19. Expect less from Servers, so you can publish more.
  20. 20. A Triple Pattern Fragments interface
 acts as a gateway to an RDF source. Client can only ask ?s ?p ?o patterns. Decompose complex SPARQL queries
 on the client-side. Low server cost, highly cacheable,
 but higher bandwidth and query time.
  21. 21. Usage of fragments.dbpedia.org is steadily increasing. #Requests February 2015 September 2016 19.239.907 4.500.000
  22. 22. And still the API has 99.99% 
 availability up to today.
  23. 23. Pragmatic archiving with HDT Sustainable querying with 
 Triple Pattern Fragments Uniform access to history with Memento Reproducibility with the 99 cents Linked Data archive Time travelling through DBpedia Reproducibility on the Web
  24. 24. The Memento Framework lets you negotiate Web resources over time.
  25. 25. Any client can transparently 
 navigate to a prior version.
  26. 26. Any client can transparently 
 navigate to a prior version.
  27. 27. data
 dump SPARQL
 endpoint Linked Data
 pages No memento support
 High consumer cost Memento support
 High consumer cost High publisher cost
 Memento support difficult For archives, interface granularity 
 and design are even more important.
  28. 28. Directly compatible with Memento data
 dump SPARQL
 query results Useful for the consumer (queryable) Sustainable for publisher Linked Data
 pages triple pattern
 fragments The Triple Pattern Fragments trade-off
 also pays off for archives.
  29. 29. Different HDT snapshots are exposed through an LDF server with Memento http://fragments.dbpedia.org
  30. 30. DBpedia pages can be made available through a proxy. http://dbpedia.org/resource/…
  31. 31. Preparing the TPF client is simply 
 adding an HTTP header. Query Engine
 SPARQL Processing Hypermedia Layer
 Fragments interaction HTTP Layer
 Resource access Dataset B Dataset A 303 Location 200 Content-Location (CORS) Client Server GET Accept-Datetime
  32. 32. A self-descriptive interface results 
 in a single datetime negotiation. Query Engine
 SPARQL Processing Hypermedia Layer
 Fragments interaction HTTP Layer
 Resource access Dataset B Dataset A Client Server GET200
  33. 33. Pragmatic archiving with HDT Sustainable querying with 
 Triple Pattern Fragments Uniform access to history with Memento Reproducibility with the 99 cents Linked Data archive Time travelling through DBpedia Reproducibility on the Web
  34. 34. There is interesting information in the history of 
 Linked Data / DBpedia. What could we learn if we could easily query it?
  35. 35. Querying history and the evolution of facts. When did a researcher with name 
 Frederik H. Kreuger and 
 born in Amsterdam die? Try it yourself:
 bit.ly/frederikkreuger
 bit.ly/frederikkreuger-2013
  36. 36. What predicates were added in DBpedia 
 between 2009 and 2014 to describe 
 a person? Analyze and profile changes 
 in a data. Try it yourself: bit.ly/personpredicates-2009 bit.ly/personpredicates-2014
  37. 37. What works by cubists were known by 
 DBpedia and VIAF in 2009? Resolve out-of-sync issues between federated sources. Try it yourself:
 bit.ly/workscubists-2009 bit.ly/workscubists
  38. 38. Start hosting your own Linked Data 
 archive (or play with the DBpedia one)! github.com/LinkedDataFragments
 bit.ly/configuring-memento www.rdfhdt.org linkeddatafragments.org
 mementoweb.org Software Documentation and specification fragments.mementodepot.org Query the DBpedia archive on
  39. 39. Reproducibility with 
 the 99 cents Linked Data archive @Miel_vds
 Herbert Van de Sompel
 Harihar Shankar 
 Lyudmila Balakireva
 Ruben Verborgh

×