Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Herbert Van de Sompel
@hvdsomp
Los Alamos National Laboratory
Acknowledgments: Lyudmila Balakireva, Harihar Shankar, Ruben Verborgh
Access to DBpedia Versions using
Memento and Triple Pattern Fragments
Miel Vander Sande
@Miel_vds
Ghent University
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Memento Framework
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Memento LDOW 2010 Submission
Herbert Van de Sompel et al. (2010) An HTTP-Based Versioning Mechanism for Linked Data
http://arxiv.org/abs/1003.3661
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Memento and Linked Data
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Memento and Linked Data
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Time-Series Analysis across DBpedia Versions
Data collected through “follow your nose” HTTP Navigation
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
First Generation DBpedia Archive: Storage
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
First Generation DBpedia Archive: Storage
Characteristics
upload software
custom
upload time
~ 24 hours per version
storage software
MongoDB
storage space
383 Gb for 10 versions
DBpedia versions
10 versions: 2.0 through 3.9
number of triples
~ 3 billion
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
First Generation DBpedia Archive: Subject-URI Access
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
First Generation DBpedia Archive: Subject-URI Access
http://dbpedia.mementodepot.org/memento/2009052/http://dbpedia.org/page/Oaxaca
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
First Generation DBpedia Archive: Subject-URI Access
Characteristics
TimeGate software
custom
access type
Subject URI & datetime
external integration
current DBpedia
clients
• all clients: direct access to
Memento Subject-URI
• Memento clients: datetime
negotiation with Subject-URI
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
DBpedia Archive @ LANL Since 2010
• Access based on Subject-URI (DBpedia Topic URI) only
• MongoDB storage
• A blob per Subject-URI per version
• Dynamically transformed to other RDF serializations
• No updates since version 3.9 (2013) of DBpedia as a result of
scalability problems
!!!
!!!
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Affordable & Useful Linked Data Archives
• A Linked Data Archive consists of temporal snapshots of one or
more Linked Data sets, whereby each temporal snapshot reflects
the state of a Linked Data set at a specific moment or interval in
time.
• How to make Linked Data Archives accessible in a manner that is
• affordable/sustainable for the publisher
• useful for the consumer
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive: Characteristics
General Characteristics Publisher Consumer
Availability
Bandwidth
Cost
Functionality
Interface Expressiveness
LOD Integration
Memento Support
Cross Time/Data
Verdict:
• Publication perspective: $$$$
• Access perspective: ++++
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Publishing
• The typical ways of publishing Linked Data on the Web:
• Subject URI access
• Data dump
• SPARQL endpoint
Let’s consider these from the perspective of Linked Data Archives,
i.e. archival storage and access
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive with Subject-URI Access
• For each temporal snapshot of a Linked Data set, and for each
Subject in that snapshot, publish an RDF description (of the Subject)
at a URI that is specific per snapshot/subject
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive with Subject-URI Access: Characteristics
General Characteristics Publisher Consumer
Availability rather high rather high
Bandwidth ~ description ~ description
Cost rather low rather high
Functionality
Interface Expressiveness rather low
LOD Integration yes
Memento Support possible
Cross Time/Data follow your nose
Verdict:
• Publication perspective: $$$$
• Access perspective: ++++
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive Using Dumps
• Renders each temporal snapshot of a Linked Data set as a data
dump that places all temporal dataset triples (as they were at a
specific moment in time) into one or more files
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive Using Dumps: Characteristics
General Characteristics Publisher Consumer
Availability high high
Bandwidth high high
Cost low high
Functionality
Interface Expressiveness download dataset
LOD Integration no
Memento Support not possible
Cross Time/Data download various datasets
Verdict:
• Publication perspective: $$$$
• Access perspective: ++++
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive with SPARQL Endpoint(s)
• For each temporal snapshot of a Linked Data set, supports arbitrary
SPARQL queries.
• Different architectural set-ups possible; no standard approach
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive Using SPARQL Endpoint(s): Characteristics
General Characteristics Publisher Consumer
Availability problematic problematic
Bandwidth ~ query ~ query
Cost high low
Functionality
Interface Expressiveness highly expressive
LOD Integration no
Memento Support hard
Cross Time/Data custom distributed queries
Verdict:
• Publication perspective: $$$$
• Access perspective: ++++
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Affordable & Useful Linked Data Archives
Linked Data Archive Type Publishing Consuming
Data Dump $$$$ ++++
SPARQL Endpoint(s) $$$$ ++++
Subject URI Access $$$$ ++++
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Fragments (Ghent U)
• Every Linked Data interface offers specific fragments of a Linked
Data set
• A fragment is described by
• Selector: what questions can I ask?
• Controls: how do I get more fragments?
• Metadata: helpful information for consumption?
• Each interface type comes with tradeoffs
• cf. the analysis thus far
http://linkeddatafragments.org
Verborgh, R. et al. (2014) Querying datsets on the web with high availability. ISWC 2014
http://ruben.verborgh.org/publications/verborgh_iswc_2014/
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Triple Pattern Fragments (Ghent U)
• Triple Pattern Fragments is a new interface with a different set of
tradeoffs that are attractive from an archival perspective
http://www.hydra-cg.com/spec/latest/triple-pattern-fragments/
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Triple Pattern Fragments (Ghent U)
• Allows querying a Linked Data set according to
?Subject ?Predicate ?Object
patterns
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Triple Pattern Fragments (Ghent U)
Controls: Responses provide navigational help for clients
• Based on emerging Hydra vocabulary for self-describing
Hypermedia-Driven Web APIs
Metadata: dataset info, estimated count (to aid client applications)
http://www.hydra-cg.com/spec/latest/core/
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Binary RDF Representation for Publication and Exchange (HDT)
http://www.w3.org/Submission/HDT/
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Binary RDF Representation for Publication and Exchange (HDT)
http://www.w3.org/Submission/HDT/
• Header-Dictionary-Triple (HDT) is a compact, binary representation
of RDF datasets.
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Binary RDF Representation for Publication and Exchange (HDT)
http://www.w3.org/Submission/HDT/
• Able to represent massive data sets
• Dictionary/Triples structure achieves
• rapid search for ?subject ?predicate ?object pattern
• high compression rates
• Header provides metadata about the dataset
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
HDT Linked Data Archive with TPF Support
• For each temporal snapshot of a Linked Data set, generate an HDT
serialization that provides access according to
?subject ?predicate ?object
patterns
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive with ?s?p?o Access: Characteristics
General Characteristics Publisher Consumer
Availability high high
Bandwidth ~ query ~ query
Cost low medium
Functionality
Interface Expressiveness better than subject-URI only
LOD Integration yes
Memento Support possible
Cross Time/Data follow your nose
Verdict:
• Publication perspective: $$$$
• Access perspective: ++++
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Affordable & Useful Linked Data Archives
Linked Data Archive Type Publishing Consuming
Data Dump $$$$ ++++
SPARQL Endpoint(s) $$$$ ++++
Subject URI Access $$$$ ++++
HDT & TPF $$$$ ++++
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: Storage
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: Storage
Characteristics
upload software
HDT-CPP
upload time
~ 4 hours per version
storage software
HDT binary files
storage space
70 Gb for 12 versions
DBpedia versions
12 versions: 2.0 through 2015
number of triples
~ 5 billion
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: ?s?p?o Query-URI Access
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: ?s?p?o Query-URI Access
http://fragments.mementodepot.org/dbpedia_3_8?subject=&predicate=http://dbpedia.org/ontology/b
irthPlace&object=http://dbpedia.org/resource/Ghent
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: ?s?p?o Query-URI Access
?s?p?o Query-URI Access
TimeGate URI http://fragments.mementodepot.org/timegate/dbpedia?
subject={DBpediaURI}&predicate={DBpediaURI}&object={DBpediaURI}
http://fragments.mementodepot.org/timegate/dbpedia?
subject=&predicate=&object=http://dbpedia.org/resource/Ghent
TimeMap URI not supported
Memento URI http://fragments.mementodepot.org/{DBpediaVersion}?subject={DBpediaURI
}&predicate={DBpediaURI}&object={DBpediaURI}
http://fragments.mementodepot.org/dbpedia_3_0?
subject=&predicate=&object=http://dbpedia.org/resource/Ghent
Further info http://mementoweb.org/depot/native/fragments/
Try it with Memento for Chrome – http://bit.ly/memento-for-chrome
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: Subject-URI Access
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: Subject-URI Access
Subject-URI Access
TimeGate URI http://dbpedia.mementodepot.org/timegate/{DBpediaURI}
http://dbpedia.mementodepot.org/timegate/http://dbpedia.org/data/Ghent
TimeMap URI http://dbpedia.mementodepot.org/timemap/link/{DBpediaURI}
http://dbpedia.mementodepot.org/timemap/link/http://dbpedia.org/data/Ghent
Memento URI http://dbpedia.mementodepot.org/{yyyymmdd}/{DBpediaURI}
http://dbpedia.mementodepot.org/20080103/http://dbpedia.org/data/Ghent
Further info http://mementoweb.org/depot/native/dbpedia/
Try it with Memento for Chrome – http://bit.ly/memento-for-chrome
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: Access
Characteristics
TimeGate software
① node.js LDF server 2.0.0
② LDF js client
access type
① ?s?p?o Query-URI & datetime
② Subject-URI & datetime
external integration
① DBpedia LDF server
② current DBpedia
clients
• all clients: direct access to
Mementos of Subject-URI and
?s?p?o Query-URI
• Memento clients: datetime
negotiation with Subject-URI and
?s?p?o Query-URI
1
2
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Building a Linked Data Archive
• Convert the archival data set(s) to HDT using HDT-CPP
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
HDT Software (C++)
https://github.com/rdfhdt/hdt-cpp
• input data requires cleaning
before processing, especially
regarding URI characters
• DBpedia data not clean
• DBpedia v3.5 was not
successfully processed
• No meaningful error
messages to help locate
problems
• memory intensive
• Kyoto Cabinet was used
to optimize storage
requirement and speed
during processing
• Java version exists but has
memory problems
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Building a Linked Data Archive
• Convert the archival data set(s) to HDT using HDT-CPP
• Download the Triple Fragment Server code
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Fragment Server (Node.js)
https://github.com/LinkedDataFragments/Server.js
• provides ?s?p?o access to
local and/or remote Linked
Data sets
• supports HDT, Turtle files, N-
Triple files, JSON-LD files,
SPARQL endpoints, in-
memory store, and
BlazeGraph Linked Data sets
• version 2.0.0 (released March
31 2016) has built-in Memento
support
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Building a Linked Data Archive
• Convert the archival data set(s) to HDT using HDT-CPP
• Download the Triple Fragment Server code
• Create the JSON config file for Memento
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Fragment Server, Memento Configuration
https://github.com/LinkedDataFragments/Server.js/wiki/Configuring-Memento
• declare archival data set(s)
• add datetime ranges for the
archival data set(s)
• add a TimeGate
• list the archival data set(s) for
which the TimeGate should
support datetime negotiation
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Building a Linked Data Archive
• Convert the archival data set(s) to HDT using HDT-CPP
• Download the Triple Fragment Server code
• Create the JSON config file for Memento
• Run the server
Herbert Van de Sompel & Miel Vander Sande
CNI Spring Meeting, San Antonio, TX, April 5 2016
Herbert Van de Sompel
@hvdsomp
Los Alamos National Laboratory
Acknowledgments: Lyudmila Balakireva, Harihar Shankar, Ruben Verborgh
Access to DBpedia Versions using
Memento and Triple Pattern Fragments
Miel Vander Sande
@Miel_vds
Ghent University

DBpedia Archive using Memento, Triple Pattern Fragments, and HDT

  • 1.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Herbert Van de Sompel @hvdsomp Los Alamos National Laboratory Acknowledgments: Lyudmila Balakireva, Harihar Shankar, Ruben Verborgh Access to DBpedia Versions using Memento and Triple Pattern Fragments Miel Vander Sande @Miel_vds Ghent University
  • 2.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Outline • Prelude: Memento and Linked Data • First Generation DBpedia Archive • Devising Affordable/Useful Linked Data Archives • Intermezzo: Triple Pattern Fragments (TPF) • Intermezzo: Binary RDF Representation (HDT) • Devising Affordable/Useful Linked Data Archives • Second Generation DBpedia Archive • Try this At Home
  • 3.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Outline • Prelude: Memento and Linked Data • First Generation DBpedia Archive • Devising Affordable/Useful Linked Data Archives • Intermezzo: Triple Pattern Fragments (TPF) • Intermezzo: Binary RDF Representation (HDT) • Devising Affordable/Useful Linked Data Archives • Second Generation DBpedia Archive • Try this At Home
  • 4.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Memento Framework
  • 5.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Memento LDOW 2010 Submission Herbert Van de Sompel et al. (2010) An HTTP-Based Versioning Mechanism for Linked Data http://arxiv.org/abs/1003.3661
  • 6.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Memento and Linked Data
  • 7.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Memento and Linked Data
  • 8.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Time-Series Analysis across DBpedia Versions Data collected through “follow your nose” HTTP Navigation
  • 9.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Outline • Prelude: Memento and Linked Data • First Generation DBpedia Archive • Devising Affordable/Useful Linked Data Archives • Intermezzo: Triple Pattern Fragments (TPF) • Intermezzo: Binary RDF Representation (HDT) • Devising Affordable/Useful Linked Data Archives • Second Generation DBpedia Archive • Try this At Home
  • 10.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 First Generation DBpedia Archive: Storage
  • 11.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 First Generation DBpedia Archive: Storage Characteristics upload software custom upload time ~ 24 hours per version storage software MongoDB storage space 383 Gb for 10 versions DBpedia versions 10 versions: 2.0 through 3.9 number of triples ~ 3 billion
  • 12.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 First Generation DBpedia Archive: Subject-URI Access
  • 13.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 First Generation DBpedia Archive: Subject-URI Access http://dbpedia.mementodepot.org/memento/2009052/http://dbpedia.org/page/Oaxaca
  • 14.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 First Generation DBpedia Archive: Subject-URI Access Characteristics TimeGate software custom access type Subject URI & datetime external integration current DBpedia clients • all clients: direct access to Memento Subject-URI • Memento clients: datetime negotiation with Subject-URI
  • 15.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 DBpedia Archive @ LANL Since 2010 • Access based on Subject-URI (DBpedia Topic URI) only • MongoDB storage • A blob per Subject-URI per version • Dynamically transformed to other RDF serializations • No updates since version 3.9 (2013) of DBpedia as a result of scalability problems !!! !!!
  • 16.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Outline • Prelude: Memento and Linked Data • First Generation DBpedia Archive • Devising Affordable/Useful Linked Data Archives • Intermezzo: Triple Pattern Fragments (TPF) • Intermezzo: Binary RDF Representation (HDT) • Devising Affordable/Useful Linked Data Archives • Second Generation DBpedia Archive • Try this At Home
  • 17.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Affordable & Useful Linked Data Archives • A Linked Data Archive consists of temporal snapshots of one or more Linked Data sets, whereby each temporal snapshot reflects the state of a Linked Data set at a specific moment or interval in time. • How to make Linked Data Archives accessible in a manner that is • affordable/sustainable for the publisher • useful for the consumer
  • 18.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Archive: Characteristics General Characteristics Publisher Consumer Availability Bandwidth Cost Functionality Interface Expressiveness LOD Integration Memento Support Cross Time/Data Verdict: • Publication perspective: $$$$ • Access perspective: ++++
  • 19.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Publishing • The typical ways of publishing Linked Data on the Web: • Subject URI access • Data dump • SPARQL endpoint Let’s consider these from the perspective of Linked Data Archives, i.e. archival storage and access
  • 20.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Archive with Subject-URI Access • For each temporal snapshot of a Linked Data set, and for each Subject in that snapshot, publish an RDF description (of the Subject) at a URI that is specific per snapshot/subject
  • 21.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Archive with Subject-URI Access: Characteristics General Characteristics Publisher Consumer Availability rather high rather high Bandwidth ~ description ~ description Cost rather low rather high Functionality Interface Expressiveness rather low LOD Integration yes Memento Support possible Cross Time/Data follow your nose Verdict: • Publication perspective: $$$$ • Access perspective: ++++
  • 22.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Archive Using Dumps • Renders each temporal snapshot of a Linked Data set as a data dump that places all temporal dataset triples (as they were at a specific moment in time) into one or more files
  • 23.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Archive Using Dumps: Characteristics General Characteristics Publisher Consumer Availability high high Bandwidth high high Cost low high Functionality Interface Expressiveness download dataset LOD Integration no Memento Support not possible Cross Time/Data download various datasets Verdict: • Publication perspective: $$$$ • Access perspective: ++++
  • 24.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Archive with SPARQL Endpoint(s) • For each temporal snapshot of a Linked Data set, supports arbitrary SPARQL queries. • Different architectural set-ups possible; no standard approach
  • 25.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Archive Using SPARQL Endpoint(s): Characteristics General Characteristics Publisher Consumer Availability problematic problematic Bandwidth ~ query ~ query Cost high low Functionality Interface Expressiveness highly expressive LOD Integration no Memento Support hard Cross Time/Data custom distributed queries Verdict: • Publication perspective: $$$$ • Access perspective: ++++
  • 26.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Affordable & Useful Linked Data Archives Linked Data Archive Type Publishing Consuming Data Dump $$$$ ++++ SPARQL Endpoint(s) $$$$ ++++ Subject URI Access $$$$ ++++
  • 27.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Outline • Prelude: Memento and Linked Data • First Generation DBpedia Archive • Devising Affordable/Useful Linked Data Archives • Intermezzo: Triple Pattern Fragments (TPF) • Intermezzo: Binary RDF Representation (HDT) • Devising Affordable/Useful Linked Data Archives • Second Generation DBpedia Archive • Try this At Home
  • 28.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Fragments (Ghent U) • Every Linked Data interface offers specific fragments of a Linked Data set • A fragment is described by • Selector: what questions can I ask? • Controls: how do I get more fragments? • Metadata: helpful information for consumption? • Each interface type comes with tradeoffs • cf. the analysis thus far http://linkeddatafragments.org Verborgh, R. et al. (2014) Querying datsets on the web with high availability. ISWC 2014 http://ruben.verborgh.org/publications/verborgh_iswc_2014/
  • 29.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Triple Pattern Fragments (Ghent U) • Triple Pattern Fragments is a new interface with a different set of tradeoffs that are attractive from an archival perspective http://www.hydra-cg.com/spec/latest/triple-pattern-fragments/
  • 30.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Triple Pattern Fragments (Ghent U) • Allows querying a Linked Data set according to ?Subject ?Predicate ?Object patterns
  • 31.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Triple Pattern Fragments (Ghent U) Controls: Responses provide navigational help for clients • Based on emerging Hydra vocabulary for self-describing Hypermedia-Driven Web APIs Metadata: dataset info, estimated count (to aid client applications) http://www.hydra-cg.com/spec/latest/core/
  • 32.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Outline • Prelude: Memento and Linked Data • First Generation DBpedia Archive • Devising Affordable/Useful Linked Data Archives • Intermezzo: Triple Pattern Fragments (TPF) • Intermezzo: Binary RDF Representation (HDT) • Devising Affordable/Useful Linked Data Archives • Second Generation DBpedia Archive • Try this At Home
  • 33.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Binary RDF Representation for Publication and Exchange (HDT) http://www.w3.org/Submission/HDT/
  • 34.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Binary RDF Representation for Publication and Exchange (HDT) http://www.w3.org/Submission/HDT/ • Header-Dictionary-Triple (HDT) is a compact, binary representation of RDF datasets.
  • 35.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Binary RDF Representation for Publication and Exchange (HDT) http://www.w3.org/Submission/HDT/ • Able to represent massive data sets • Dictionary/Triples structure achieves • rapid search for ?subject ?predicate ?object pattern • high compression rates • Header provides metadata about the dataset
  • 36.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Outline • Prelude: Memento and Linked Data • First Generation DBpedia Archive • Devising Affordable/Useful Linked Data Archives • Intermezzo: Triple Pattern Fragments (TPF) • Intermezzo: Binary RDF Representation (HDT) • Devising Affordable/Useful Linked Data Archives • Second Generation DBpedia Archive • Try this At Home
  • 37.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 HDT Linked Data Archive with TPF Support • For each temporal snapshot of a Linked Data set, generate an HDT serialization that provides access according to ?subject ?predicate ?object patterns
  • 38.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Archive with ?s?p?o Access: Characteristics General Characteristics Publisher Consumer Availability high high Bandwidth ~ query ~ query Cost low medium Functionality Interface Expressiveness better than subject-URI only LOD Integration yes Memento Support possible Cross Time/Data follow your nose Verdict: • Publication perspective: $$$$ • Access perspective: ++++
  • 39.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Affordable & Useful Linked Data Archives Linked Data Archive Type Publishing Consuming Data Dump $$$$ ++++ SPARQL Endpoint(s) $$$$ ++++ Subject URI Access $$$$ ++++ HDT & TPF $$$$ ++++
  • 40.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Outline • Prelude: Memento and Linked Data • First Generation DBpedia Archive • Devising Affordable/Useful Linked Data Archives • Intermezzo: Triple Pattern Fragments (TPF) • Intermezzo: Binary RDF Representation (HDT) • Devising Affordable/Useful Linked Data Archives • Second Generation DBpedia Archive • Try this At Home
  • 41.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Second Generation DBpedia Archive: Storage
  • 42.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Second Generation DBpedia Archive: Storage Characteristics upload software HDT-CPP upload time ~ 4 hours per version storage software HDT binary files storage space 70 Gb for 12 versions DBpedia versions 12 versions: 2.0 through 2015 number of triples ~ 5 billion
  • 43.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Second Generation DBpedia Archive: ?s?p?o Query-URI Access
  • 44.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Second Generation DBpedia Archive: ?s?p?o Query-URI Access http://fragments.mementodepot.org/dbpedia_3_8?subject=&predicate=http://dbpedia.org/ontology/b irthPlace&object=http://dbpedia.org/resource/Ghent
  • 45.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Second Generation DBpedia Archive: ?s?p?o Query-URI Access ?s?p?o Query-URI Access TimeGate URI http://fragments.mementodepot.org/timegate/dbpedia? subject={DBpediaURI}&predicate={DBpediaURI}&object={DBpediaURI} http://fragments.mementodepot.org/timegate/dbpedia? subject=&predicate=&object=http://dbpedia.org/resource/Ghent TimeMap URI not supported Memento URI http://fragments.mementodepot.org/{DBpediaVersion}?subject={DBpediaURI }&predicate={DBpediaURI}&object={DBpediaURI} http://fragments.mementodepot.org/dbpedia_3_0? subject=&predicate=&object=http://dbpedia.org/resource/Ghent Further info http://mementoweb.org/depot/native/fragments/ Try it with Memento for Chrome – http://bit.ly/memento-for-chrome
  • 46.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Second Generation DBpedia Archive: Subject-URI Access
  • 47.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Second Generation DBpedia Archive: Subject-URI Access Subject-URI Access TimeGate URI http://dbpedia.mementodepot.org/timegate/{DBpediaURI} http://dbpedia.mementodepot.org/timegate/http://dbpedia.org/data/Ghent TimeMap URI http://dbpedia.mementodepot.org/timemap/link/{DBpediaURI} http://dbpedia.mementodepot.org/timemap/link/http://dbpedia.org/data/Ghent Memento URI http://dbpedia.mementodepot.org/{yyyymmdd}/{DBpediaURI} http://dbpedia.mementodepot.org/20080103/http://dbpedia.org/data/Ghent Further info http://mementoweb.org/depot/native/dbpedia/ Try it with Memento for Chrome – http://bit.ly/memento-for-chrome
  • 48.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Second Generation DBpedia Archive: Access Characteristics TimeGate software ① node.js LDF server 2.0.0 ② LDF js client access type ① ?s?p?o Query-URI & datetime ② Subject-URI & datetime external integration ① DBpedia LDF server ② current DBpedia clients • all clients: direct access to Mementos of Subject-URI and ?s?p?o Query-URI • Memento clients: datetime negotiation with Subject-URI and ?s?p?o Query-URI 1 2
  • 49.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Outline • Prelude: Memento and Linked Data • First Generation DBpedia Archive • Devising Affordable/Useful Linked Data Archives • Intermezzo: Triple Pattern Fragments (TPF) • Intermezzo: Binary RDF Representation (HDT) • Devising Affordable/Useful Linked Data Archives • Second Generation DBpedia Archive • Try this At Home
  • 50.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Building a Linked Data Archive • Convert the archival data set(s) to HDT using HDT-CPP
  • 51.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 HDT Software (C++) https://github.com/rdfhdt/hdt-cpp • input data requires cleaning before processing, especially regarding URI characters • DBpedia data not clean • DBpedia v3.5 was not successfully processed • No meaningful error messages to help locate problems • memory intensive • Kyoto Cabinet was used to optimize storage requirement and speed during processing • Java version exists but has memory problems
  • 52.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Building a Linked Data Archive • Convert the archival data set(s) to HDT using HDT-CPP • Download the Triple Fragment Server code
  • 53.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Fragment Server (Node.js) https://github.com/LinkedDataFragments/Server.js • provides ?s?p?o access to local and/or remote Linked Data sets • supports HDT, Turtle files, N- Triple files, JSON-LD files, SPARQL endpoints, in- memory store, and BlazeGraph Linked Data sets • version 2.0.0 (released March 31 2016) has built-in Memento support
  • 54.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Building a Linked Data Archive • Convert the archival data set(s) to HDT using HDT-CPP • Download the Triple Fragment Server code • Create the JSON config file for Memento
  • 55.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Linked Data Fragment Server, Memento Configuration https://github.com/LinkedDataFragments/Server.js/wiki/Configuring-Memento • declare archival data set(s) • add datetime ranges for the archival data set(s) • add a TimeGate • list the archival data set(s) for which the TimeGate should support datetime negotiation
  • 56.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Building a Linked Data Archive • Convert the archival data set(s) to HDT using HDT-CPP • Download the Triple Fragment Server code • Create the JSON config file for Memento • Run the server
  • 57.
    Herbert Van deSompel & Miel Vander Sande CNI Spring Meeting, San Antonio, TX, April 5 2016 Herbert Van de Sompel @hvdsomp Los Alamos National Laboratory Acknowledgments: Lyudmila Balakireva, Harihar Shankar, Ruben Verborgh Access to DBpedia Versions using Memento and Triple Pattern Fragments Miel Vander Sande @Miel_vds Ghent University