Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DBpedia Archive using Memento, Triple Pattern Fragments, and HDT

3,886 views

Published on

DBpedia is the Linked Data version of Wikipedia. Starting in 2007, several DBpedia dumps have been made available for download. In 2010, the Research Library at the Los Alamos National Laboratory used these dumps to deploy a Memento-compliant DBpedia Archive, in order to demonstrate the applicability and appeal of accessing temporal versions of Linked Data sets using the Memento “Time Travel for the Web” protocol. The archive supported datetime negotiation to access various temporal versions of RDF descriptions of DBpedia subject URIs.

In a recent collaboration with the iMinds Group of Ghent University, the DBpedia Archive received a major overhaul. The initial MongoDB storage approach, which was unable to handle increasingly large DBpedia dumps, was replaced by HDT, the Binary RDF Representation for Publication and Exchange. And, in addition to the existing subject URI access point, Triple Pattern Fragments access, as proposed by the Linked Data Fragments project, was added. This allows datetime negotiation for URIs that identify RDF triples that match subject/predicate/object patterns. To add this powerful capability, native Memento support was added to the Linked Data Fragments Server of Ghent University.

In this talk, we will include a brief refresher of Memento, and will cover Linked Data Fragments, Triple Pattern Fragments, and HDT in more detail. We will share lessons learned from this effort and demo the new DBpedia Archive, which, at this point, holds over 5 billion RDF triples.

Published in: Internet

DBpedia Archive using Memento, Triple Pattern Fragments, and HDT

  1. 1. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Herbert Van de Sompel @hvdsomp Los Alamos National Laboratory Acknowledgments: Lyudmila Balakireva, Harihar Shankar, Ruben Verborgh Access to DBpedia Versions using Memento and Triple Pattern Fragments Miel Vander Sande @Miel_vds Ghent University
  2. 2. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Outline • Prelude: Memento and Linked Data • First Generation DBpedia Archive • Devising Affordable/Useful Linked Data Archives • Intermezzo: Triple Pattern Fragments (TPF) • Intermezzo: Binary RDF Representation (HDT) • Devising Affordable/Useful Linked Data Archives • Second Generation DBpedia Archive • Try this At Home
  3. 3. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Outline • Prelude: Memento and Linked Data • First Generation DBpedia Archive • Devising Affordable/Useful Linked Data Archives • Intermezzo: Triple Pattern Fragments (TPF) • Intermezzo: Binary RDF Representation (HDT) • Devising Affordable/Useful Linked Data Archives • Second Generation DBpedia Archive • Try this At Home
  4. 4. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Memento Framework
  5. 5. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Memento LDOW 2010 Submission Herbert Van de Sompel et al. (2010) An HTTP-Based Versioning Mechanism for Linked Data http://arxiv.org/abs/1003.3661
  6. 6. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Memento and Linked Data
  7. 7. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Memento and Linked Data
  8. 8. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Time-Series Analysis across DBpedia Versions Data collected through “follow your nose” HTTP Navigation
  9. 9. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Outline • Prelude: Memento and Linked Data • First Generation DBpedia Archive • Devising Affordable/Useful Linked Data Archives • Intermezzo: Triple Pattern Fragments (TPF) • Intermezzo: Binary RDF Representation (HDT) • Devising Affordable/Useful Linked Data Archives • Second Generation DBpedia Archive • Try this At Home
  10. 10. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 First Generation DBpedia Archive: Storage
  11. 11. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 First Generation DBpedia Archive: Storage Characteristics upload software custom upload time ~ 24 hours per version storage software MongoDB storage space 383 Gb for 10 versions DBpedia versions 10 versions: 2.0 through 3.9 number of triples ~ 3 billion
  12. 12. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 First Generation DBpedia Archive: Subject-URI Access
  13. 13. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 First Generation DBpedia Archive: Subject-URI Access http://dbpedia.mementodepot.org/memento/2009052/http://dbpedia.org/page/Oaxaca
  14. 14. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 First Generation DBpedia Archive: Subject-URI Access Characteristics TimeGate software custom access type Subject URI & datetime external integration current DBpedia clients • all clients: direct access to Memento Subject-URI • Memento clients: datetime negotiation with Subject-URI
  15. 15. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 DBpedia Archive @ LANL Since 2010 • Access based on Subject-URI (DBpedia Topic URI) only • MongoDB storage • A blob per Subject-URI per version • Dynamically transformed to other RDF serializations • No updates since version 3.9 (2013) of DBpedia as a result of scalability problems !!! !!!
  16. 16. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Outline • Prelude: Memento and Linked Data • First Generation DBpedia Archive • Devising Affordable/Useful Linked Data Archives • Intermezzo: Triple Pattern Fragments (TPF) • Intermezzo: Binary RDF Representation (HDT) • Devising Affordable/Useful Linked Data Archives • Second Generation DBpedia Archive • Try this At Home
  17. 17. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Affordable & Useful Linked Data Archives • A Linked Data Archive consists of temporal snapshots of one or more Linked Data sets, whereby each temporal snapshot reflects the state of a Linked Data set at a specific moment or interval in time. • How to make Linked Data Archives accessible in a manner that is • affordable/sustainable for the publisher • useful for the consumer
  18. 18. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Archive: Characteristics General Characteristics Publisher Consumer Availability Bandwidth Cost Functionality Interface Expressiveness LOD Integration Memento Support Cross Time/Data Verdict: • Publication perspective: $$$$ • Access perspective: ++++
  19. 19. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Publishing • The typical ways of publishing Linked Data on the Web: • Subject URI access • Data dump • SPARQL endpoint Let’s consider these from the perspective of Linked Data Archives, i.e. archival storage and access
  20. 20. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Archive with Subject-URI Access • For each temporal snapshot of a Linked Data set, and for each Subject in that snapshot, publish an RDF description (of the Subject) at a URI that is specific per snapshot/subject
  21. 21. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Archive with Subject-URI Access: Characteristics General Characteristics Publisher Consumer Availability rather high rather high Bandwidth ~ description ~ description Cost rather low rather high Functionality Interface Expressiveness rather low LOD Integration yes Memento Support possible Cross Time/Data follow your nose Verdict: • Publication perspective: $$$$ • Access perspective: ++++
  22. 22. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Archive Using Dumps • Renders each temporal snapshot of a Linked Data set as a data dump that places all temporal dataset triples (as they were at a specific moment in time) into one or more files
  23. 23. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Archive Using Dumps: Characteristics General Characteristics Publisher Consumer Availability high high Bandwidth high high Cost low high Functionality Interface Expressiveness download dataset LOD Integration no Memento Support not possible Cross Time/Data download various datasets Verdict: • Publication perspective: $$$$ • Access perspective: ++++
  24. 24. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Archive with SPARQL Endpoint(s) • For each temporal snapshot of a Linked Data set, supports arbitrary SPARQL queries. • Different architectural set-ups possible; no standard approach
  25. 25. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Archive Using SPARQL Endpoint(s): Characteristics General Characteristics Publisher Consumer Availability problematic problematic Bandwidth ~ query ~ query Cost high low Functionality Interface Expressiveness highly expressive LOD Integration no Memento Support hard Cross Time/Data custom distributed queries Verdict: • Publication perspective: $$$$ • Access perspective: ++++
  26. 26. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Affordable & Useful Linked Data Archives Linked Data Archive Type Publishing Consuming Data Dump $$$$ ++++ SPARQL Endpoint(s) $$$$ ++++ Subject URI Access $$$$ ++++
  27. 27. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Outline • Prelude: Memento and Linked Data • First Generation DBpedia Archive • Devising Affordable/Useful Linked Data Archives • Intermezzo: Triple Pattern Fragments (TPF) • Intermezzo: Binary RDF Representation (HDT) • Devising Affordable/Useful Linked Data Archives • Second Generation DBpedia Archive • Try this At Home
  28. 28. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Fragments (Ghent U) • Every Linked Data interface offers specific fragments of a Linked Data set • A fragment is described by • Selector: what questions can I ask? • Controls: how do I get more fragments? • Metadata: helpful information for consumption? • Each interface type comes with tradeoffs • cf. the analysis thus far http://linkeddatafragments.org Verborgh, R. et al. (2014) Querying datsets on the web with high availability. ISWC 2014 http://ruben.verborgh.org/publications/verborgh_iswc_2014/
  29. 29. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Triple Pattern Fragments (Ghent U) • Triple Pattern Fragments is a new interface with a different set of tradeoffs that are attractive from an archival perspective http://www.hydra-cg.com/spec/latest/triple-pattern-fragments/
  30. 30. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Triple Pattern Fragments (Ghent U) • Allows querying a Linked Data set according to ?Subject ?Predicate ?Object patterns
  31. 31. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Triple Pattern Fragments (Ghent U) Controls: Responses provide navigational help for clients • Based on emerging Hydra vocabulary for self-describing Hypermedia-Driven Web APIs Metadata: dataset info, estimated count (to aid client applications) http://www.hydra-cg.com/spec/latest/core/
  32. 32. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Outline • Prelude: Memento and Linked Data • First Generation DBpedia Archive • Devising Affordable/Useful Linked Data Archives • Intermezzo: Triple Pattern Fragments (TPF) • Intermezzo: Binary RDF Representation (HDT) • Devising Affordable/Useful Linked Data Archives • Second Generation DBpedia Archive • Try this At Home
  33. 33. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Binary RDF Representation for Publication and Exchange (HDT) http://www.w3.org/Submission/HDT/
  34. 34. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Binary RDF Representation for Publication and Exchange (HDT) http://www.w3.org/Submission/HDT/ • Header-Dictionary-Triple (HDT) is a compact, binary representation of RDF datasets.
  35. 35. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Binary RDF Representation for Publication and Exchange (HDT) http://www.w3.org/Submission/HDT/ • Able to represent massive data sets • Dictionary/Triples structure achieves • rapid search for ?subject ?predicate ?object pattern • high compression rates • Header provides metadata about the dataset
  36. 36. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Outline • Prelude: Memento and Linked Data • First Generation DBpedia Archive • Devising Affordable/Useful Linked Data Archives • Intermezzo: Triple Pattern Fragments (TPF) • Intermezzo: Binary RDF Representation (HDT) • Devising Affordable/Useful Linked Data Archives • Second Generation DBpedia Archive • Try this At Home
  37. 37. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 HDT Linked Data Archive with TPF Support • For each temporal snapshot of a Linked Data set, generate an HDT serialization that provides access according to ?subject ?predicate ?object patterns
  38. 38. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Archive with ?s?p?o Access: Characteristics General Characteristics Publisher Consumer Availability high high Bandwidth ~ query ~ query Cost low medium Functionality Interface Expressiveness better than subject-URI only LOD Integration yes Memento Support possible Cross Time/Data follow your nose Verdict: • Publication perspective: $$$$ • Access perspective: ++++
  39. 39. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Affordable & Useful Linked Data Archives Linked Data Archive Type Publishing Consuming Data Dump $$$$ ++++ SPARQL Endpoint(s) $$$$ ++++ Subject URI Access $$$$ ++++ HDT & TPF $$$$ ++++
  40. 40. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Outline • Prelude: Memento and Linked Data • First Generation DBpedia Archive • Devising Affordable/Useful Linked Data Archives • Intermezzo: Triple Pattern Fragments (TPF) • Intermezzo: Binary RDF Representation (HDT) • Devising Affordable/Useful Linked Data Archives • Second Generation DBpedia Archive • Try this At Home
  41. 41. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Second Generation DBpedia Archive: Storage
  42. 42. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Second Generation DBpedia Archive: Storage Characteristics upload software HDT-CPP upload time ~ 4 hours per version storage software HDT binary files storage space 70 Gb for 12 versions DBpedia versions 12 versions: 2.0 through 2015 number of triples ~ 5 billion
  43. 43. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Second Generation DBpedia Archive: ?s?p?o Query-URI Access
  44. 44. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Second Generation DBpedia Archive: ?s?p?o Query-URI Access http://fragments.mementodepot.org/dbpedia_3_8?subject=&predicate=http://dbpedia.org/ontology/b irthPlace&object=http://dbpedia.org/resource/Ghent
  45. 45. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Second Generation DBpedia Archive: ?s?p?o Query-URI Access ?s?p?o Query-URI Access TimeGate URI http://fragments.mementodepot.org/timegate/dbpedia? subject={DBpediaURI}&predicate={DBpediaURI}&object={DBpediaURI} http://fragments.mementodepot.org/timegate/dbpedia? subject=&predicate=&object=http://dbpedia.org/resource/Ghent TimeMap URI not supported Memento URI http://fragments.mementodepot.org/{DBpediaVersion}?subject={DBpediaURI }&predicate={DBpediaURI}&object={DBpediaURI} http://fragments.mementodepot.org/dbpedia_3_0? subject=&predicate=&object=http://dbpedia.org/resource/Ghent Further info http://mementoweb.org/depot/native/fragments/ Try it with Memento for Chrome – http://bit.ly/memento-for-chrome
  46. 46. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Second Generation DBpedia Archive: Subject-URI Access
  47. 47. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Second Generation DBpedia Archive: Subject-URI Access Subject-URI Access TimeGate URI http://dbpedia.mementodepot.org/timegate/{DBpediaURI} http://dbpedia.mementodepot.org/timegate/http://dbpedia.org/data/Ghent TimeMap URI http://dbpedia.mementodepot.org/timemap/link/{DBpediaURI} http://dbpedia.mementodepot.org/timemap/link/http://dbpedia.org/data/Ghent Memento URI http://dbpedia.mementodepot.org/{yyyymmdd}/{DBpediaURI} http://dbpedia.mementodepot.org/20080103/http://dbpedia.org/data/Ghent Further info http://mementoweb.org/depot/native/dbpedia/ Try it with Memento for Chrome – http://bit.ly/memento-for-chrome
  48. 48. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Second Generation DBpedia Archive: Access Characteristics TimeGate software ① node.js LDF server 2.0.0 ② LDF js client access type ① ?s?p?o Query-URI & datetime ② Subject-URI & datetime external integration ① DBpedia LDF server ② current DBpedia clients • all clients: direct access to Mementos of Subject-URI and ?s?p?o Query-URI • Memento clients: datetime negotiation with Subject-URI and ?s?p?o Query-URI 1 2
  49. 49. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Outline • Prelude: Memento and Linked Data • First Generation DBpedia Archive • Devising Affordable/Useful Linked Data Archives • Intermezzo: Triple Pattern Fragments (TPF) • Intermezzo: Binary RDF Representation (HDT) • Devising Affordable/Useful Linked Data Archives • Second Generation DBpedia Archive • Try this At Home
  50. 50. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Building a Linked Data Archive • Convert the archival data set(s) to HDT using HDT-CPP
  51. 51. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 HDT Software (C++) https://github.com/rdfhdt/hdt-cpp • input data requires cleaning before processing, especially regarding URI characters • DBpedia data not clean • DBpedia v3.5 was not successfully processed • No meaningful error messages to help locate problems • memory intensive • Kyoto Cabinet was used to optimize storage requirement and speed during processing • Java version exists but has memory problems
  52. 52. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Building a Linked Data Archive • Convert the archival data set(s) to HDT using HDT-CPP • Download the Triple Fragment Server code
  53. 53. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Fragment Server (Node.js) https://github.com/LinkedDataFragments/Server.js • provides ?s?p?o access to local and/or remote Linked Data sets • supports HDT, Turtle files, N- Triple files, JSON-LD files, SPARQL endpoints, in- memory store, and BlazeGraph Linked Data sets • version 2.0.0 (released March 31 2016) has built-in Memento support
  54. 54. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Building a Linked Data Archive • Convert the archival data set(s) to HDT using HDT-CPP • Download the Triple Fragment Server code • Create the JSON config file for Memento
  55. 55. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Fragment Server, Memento Configuration https://github.com/LinkedDataFragments/Server.js/wiki/Configuring-Memento • declare archival data set(s) • add datetime ranges for the archival data set(s) • add a TimeGate • list the archival data set(s) for which the TimeGate should support datetime negotiation
  56. 56. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Building a Linked Data Archive • Convert the archival data set(s) to HDT using HDT-CPP • Download the Triple Fragment Server code • Create the JSON config file for Memento • Run the server
  57. 57. Herbert Van de Sompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Herbert Van de Sompel @hvdsomp Los Alamos National Laboratory Acknowledgments: Lyudmila Balakireva, Harihar Shankar, Ruben Verborgh Access to DBpedia Versions using Memento and Triple Pattern Fragments Miel Vander Sande @Miel_vds Ghent University

×