LiveLinkedData - TransWebData - Nantes 2013

  • 99 views
Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
99
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • More Over ,Linked data cloud has been growing steadily, for example, here we see the 2007 picture of the Linked Open Data.
  • To this big picture in 2011
  • Linked data main rule is that your data needs to reference or “link” with existing knowledge in other sources, for example, LinkedMDB relates its entity of the movie “The Shining” to DBPedia, Yago or GeoNames.
  • Let’s see this applied to our previous example:Now the connections are streams that I subscribe, and from where I pull the information I’m interested in,(Like this…)Notice that with this I can solve the issue of freshness, as whenever someone updates its data, I can immediately react (example given…)And I also ease the performance issue, as I will do local processing.As in the other approaches, I can solve the data curation issue by incorporating a source and locally editing (example given…)This curation also generates an operation stream that I publish and advertise to others, people will start to follow my changes and I would also solve the write enabling issue.
  • So, our expectacion is that other data publishers will also start to open its updates, favoring the creation of mashups by the means of the “Follow your change” relation, creating another network on top of the actual Linked Data cloud, Of course this new mashups are not static, they can perform adding-value operations, and if this value is high enough, the original publisher can also start following these changes and, voilà, we have the writing that we want.
  • Before jumping into that, we need to document us a little about what is RDF and what are they operations
  • SPARQL Operations have a very well defined formal semantics, one set of operations to operate graphs (all here but the last two) and other to operate on datasets
  • Read and then…For example, pattern operations, are state dependant, so I cannot preserve its intention, same thing with blank nodes, that are like, existential variables, and SPARQL specification leaves to each store the how to name them.
  • To implement the pattern operations, we chose to calculate the affected triples at the local state, and send that to the other pairs.
  • Read…
  • The main interest of Linked Data is to perform queries on multiple datasets, to make that, and assuming we already discovered the relevant datasets, the first way to proceed is to do a local copy of them and perform the query on it. For example, if I ask what is the largest city of Vicente Fernandez country, by asking the country to MusicBrainz and the largest city to wikipedia, but, if the original datasets get modified, I lose freshness.
  • On the other hand, instead of bringing the data to me, I can go to it, making the query in a distributed fashion. The drawback is that I depend on the reliability and performance of the endpoints, , if one is down, my query will fail.
  • So, how does this look on real Live Linked Data: We already mentioned that DBPedia has start to publishing its updates: so, th
  • So an issue that we can raise here is: what happens when I found a mistake in a dataset that is not mine, where I don’t know how or don’t have the rights to modify. Put it in a more general way, how do I make this Linked Data network “Writable”.
  • So, we know from CRDT theory that if all operations commute, the object is conflict-free… but, bas

Transcript

  • 1. Live Linked Data Making Linked Data writable with eventual consistency guarantees Luis-Daniel Ibáñez, Nantes University, GDD team Luis-Daniel Ibáñez, Hala Skaf-Molli, Pascal Molli, and Olivier Corby. Synchronizing semantic stores with commutative replicated data types. In Semantic Web Collaborative Spaces (SWCS) @ WWW’12 Luis-Daniel Ibáñez, Hala Skaf-Molli, Pascal Molli, and Olivier Corby. Live Linked Data: Synchronizing semantic stores with Commutative Replicated Data Types. Special Issue on Resource Discovery, IJMSO (to appear)
  • 2. Linked Data Autonomous participants agree on a set of principles for publishing RDF data, allowing its querying and browsing across distributed servers. In 2011, More than 30 billion RDF triples distributed between ~700 autonomous participants1. 1Billion Triple Challenge + Linked Open Data Metadata T. Käfer et. Al., Towards a Dynamic Linked Data Observatory, LDOW@WWW12
  • 3. LOD - 2007 October
  • 4. LOD - 2011 September
  • 5. Here is a list of links created by LinkedMDB that relate the movie "The Shining" to other open datasets: owl:SameAs to "The_Shining_(film)" on DBpedia owl:SameAs to "The_Shining_(film)" YAGO dbpedia:hasPhotoCollection to The movie's photos on flickr™ wrappr movie:relatedBook to the movie's related book on RDF Book Mashup owl:SameAs from one of the music composers of the movie, Béla Bartók, toMusicBrainz: http://zitgist.com/music/artist/fd14da1b-3c2d-4cc8-9ca6fc8c62ce6988 rdfs:SeeAlso from one of the writers of the movie, Stanley Kubrick to RDF Book Mashup: http://www4.wiwiss.fu-berlin.de/bookmashup/persons/Stanley+Kubrick foaf:based_near to the locations of the movie in the US and GB:http://sws.geonames.org/2635167/ and http://sws.geonames.org/6252001/ movie:language to the language of the movie on lingvoj.org:http://www.lingvoj.org/lingvo/en Source : http://wiki.linkedmdb.org/Main/Interlinking
  • 6. Querying Linked Data Freshness vs. Performance Dbpedia English Local Copying  Not Fresh Dbpedia Spanish Free Base Wolfram My Super Cool Query Answerer about Venezuela Distributed querying  Performance and reliability
  • 7. Querying Linked Data Data Quality Dbpedia English 302km2, 439,947h All different =S Dbpedia Spanish Free Base 311km2, 407,577h Unknown, 439,947h What is the area and population of Girardot Municipality? 314km2, 407,109h Wolfram 302km2, 396,095h Open Data Venezuela All wrong!!!
  • 8. Updating Linked Data Dbpedia English Dbpedia Spanish Free Base 你是什么意 思? Update Girardot municipality’s data to the official values. 314km2, 407,109h Wolfram Open Data Venezuela Editing other’s datasets is generally not possible. How to make Linked Data Writable? How to move from Linked Data 1.0 to Linked Data 2.0?
  • 9. Live Linked Data Approach (1) SPARQL Update 1.1 allows to update RDF data locally. Update feeds to provide streams of updates (e.g. DBPedia Live) -> Allow data freshness. Processing is local The writing is “indirect”, only if the others decides to consume my stream I can update their datasets.
  • 10. Live Linked Data DELETE { Dbpediaen/Girardot area ?a1 . Dbpediaen/Girardot popul ?p1 . ?ent area ?a2 . ?ent popul ?p2 . WHERE { ?ent sameAs Dbpediaen/Girardot }} ; INSERT DATA { ODVenezuela/Girardot area 314 Dbpedia ODVenezuela/Girardot popul 407109 English ODVenezuela/Girardot sameAs Dbpediaen/Girardot ODVenezuela/Girardot sameAs Freebase/Girardot Dbpediaen/Girardot area 302 Dbpediaen/Girardot popul 439947 … } Would you follow my changes? They are official… ;) “Follow your change” SPARQL Update Stream consumption  Freshness Dbpedia Dbpediaes/Girardot area 311 Spanish Dbpediaes/Girardot popul 407577 Great! Freebase/Girardot area 308 Free Freebase/Girardot area “null” Freebase/Girardot popul 439947 ODVenezuela/Girardot sameAs Dbpediaen/Girardot Base ODVenezuela/Girardot sameAs Freebase/Girardot … Dbpediaen/Girardot sameAs Dbpediaes/Girardot Dbpediaen/Girardot sameAs Freebase/Girardot … Wolfram Wolfram/Girardot area 302 Wolfram/Girardot popul 396095 ODVenezuela/Girardot area 314 Open Data ODVenezuela/Girardot popul 407109 Venezuela Local Processing  Performance SPARQL Update Stream publishing  Write enabling
  • 11. Live Linked Data complements LOD Cloud Follow your changes
  • 12. Live Linked Data Overview A social network of linked data participants based on a « follow your change » relation. Makes Linked Data a consistent « read/write » space : from linked data 1.0 to linked data 2.0 Creates assemblies of datasets and enable « synchronize and search » paradigm: between warehousing approach and distributed queries approach.
  • 13. The Forgotten Issue: Consistency Synchronizing semantic sources is not new… RDFSync Sparql Push (PubSubHub) ResourceSync… But none of them addresses the consistency when concurrent operations or network delivery-in-disorder occur. Source1 Source2 Insert DATA {a b c} Delete DATA {a b c} Insert DATA {a b c} abc ≠ abc Output depends on order of reception
  • 14. Keeping Consistent Consistency = Convergence + Intention preservation. Strong Consistency Requires blocking. Expensive Eventual Consistency Apply operations without blocking If conflict detected, resolve. Can be very expensive at large scale in network’s size. If the type that we are dealing is Conflict-Free, no worries about conflict detection and resolution.
  • 15. Conflict-Free Replicated Data Types1 An abstract data type where: The application of all operations over any state commutes, S o opi o opj = S o opj o opi => Convergence. Caveat1: Intention preservation should be checked separately. Caveat2: Should be useful, by itself or by corresponding to the concurrent semantic of an useful type (set, graph…). Caveat2.1: Should be affordable. Enable massive optimistic (eventual) replication. 1 M. Shapiro, N. Preguiça, C. Baquero, M. Zawirski. Conflict-free Replicated Data Types. SSS 2011.
  • 16. Enabling LLD with CRDTs Construct a CRDT for RDF Graph Stores with SPARQL UPDATE 1.1 Operations with minimal overhead under Live Linked Data conditions: Unknown but steadily growing number of autonomous participants. No central controls (coordination, primary replicas, small core of datacenters) Follow your change relation induces the network. Dynamicity parameters: Degree of change / Change frequency  varies between producers. Growth rate  Positive, knowledge grows.
  • 17. RDF Datasets Assume there are pairwise disjoint infinite sets I , B, and L (IRIs, Blank nodes, and Literals, respectively). A triple (s,p,o) ∈ (I∪B)×I×(I∪B∪L) is called an RDF triple. In this tuple, s is the subject, p the predicate, and o the object1. An RDF Graph is a set of RDF triples1. An RDF dataset is a set: { G, (<u1>, G1), (<u2>, G2), . . . (<un>, Gn) } where G and each Gi are graphs, and each <ui> is an IRI. Each <ui> is distinct. G, is called the “default graph” and the pairs (<ui>, Gi) are called “named graphs”2. 1 J. Pérez, M. Arenas and C. Gutiérrez, Semantics and Complexity of SPARQL, ACM Transactions on DB Systems, 2009 2 http://www.w3.org/TR/sparql11-query/#sparqlDataset
  • 18. SPARQL Update Queries Syntax Formal Operation INSERT DATA QuadData Dataset-UNION(GS, Dataset(QuadData,{},GS,GS)) DELETE DATA QuadData Dataset-DIFF(GS, Dataset(QuadData,{},GS,GS)) DELETE QuadPatternDEL INSERT QuadPatternINS WHERE GroupGraphPattern Dataset-UNION(Dataset-DIFF(GS, Dataset(QuadPatternDEL,GroupGraphPattern,DS,GS)), Dataset(QuadPatternINS, GroupGraphPattern,DS,GS) DELETE QuadPatternDEL WHERE GroupGraphPattern Dataset-UNION(Dataset-DIFF(GS, Dataset(QuadPatternDEL,GroupGraphPattern,DS,GS)), Dataset({}, GroupGraphPattern,DS,GS) INSERT QuadPatternINS UsingClause* WHERE GroupGraphPattern Dataset-UNION(Dataset-DIFF(GS, Dataset({},GroupGraphPattern,DS,GS)), Dataset(QuadPatternINS, GroupGraphPattern,DS,GS) LOAD (SILENT)? IRIref Dataset-UNION(GS, { graph(documentIRI) } ) CLEAR (SILENT)? DEFAULT {{}} union {(irii, Gi) | 1 ≤ i ≤ n} CREATE (SILENT)? IRIref GS union {(iri, {})} if iri not in graphNames(GS); otherwise, OpCreate(GS, iri) = GS DROP (SILENT)? IRIref GS if iri not in graphNames(GS); otherwise, OpDrop(GS, irij) = {DG} union {(irii, Gi ) | i ≠ j and 1 ≤ i ≤ n} http://www.w3.org/TR/sparql11-update/#formalModel
  • 19. Insert Ground Triples PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX ns: <http://example.org/ns#> INSERT DATA { GRAPH <http://example/bookStore> { <http://example/book1> ns:price 42 . <http://example/book1> dc:creator “A.N.Other” . } } Data before: # Graph: http://example/bookStore @prefix dc: <http://purl.org/dc/elements/1.1/> . <http://example/book1> dc:title "Fundamentals of Compiler Design" . Data after: # Graph: http://example/bookStore @prefix dc: <http://purl.org/dc/elements/1.1/> . @prefix ns: <http://example.org/ns#> . <http://example/book1> dc:title "Fundamentals of Compiler Design" . <http://example/book1> ns:price 42 . <http://example/book1> dc:creator “A.N.Other” 42 . http://www.w3.org/TR/sparql11-update/
  • 20. Delete-Insert PREFIX foaf: http://xmlns.com/foaf/0.1/ WITH <http://example/addresses> DELETE { ?person foaf:givenName 'Bill' } INSERT { ?person foaf:givenName 'William' } WHERE { ?person foaf:givenName 'Bill’ } Data before: Data after: # Graph: http://example/addresses # Graph: http://example/addresses @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . <http://example/president25> foaf:givenName "Bill" . <http://example/president25> foaf:givenName "William" . <http://example/president25> foaf:familyName "McKinley" .<http://example/president25> foaf:familyName "McKinley" . <http://example/president27> foaf:givenName "Bill" . <http://example/president27> foaf:givenName "William" . <http://example/president27> foaf:familyName "Taft" . <http://example/president27> foaf:familyName "Taft" . <http://example/president42> foaf:givenName "Bill" . <http://example/president42> foaf:givenName "William" . <http://example/president42> foaf:familyName "Clinton" . <http://example/president42> foaf:familyName "Clinton" .
  • 21. Enabling Intentions Preservation in LLD Observed effect of SPARQL update queries at generation time should be preserved at re-execution time… If an Sparql query inserts a triple at generation time, this triple will be added at all sites whatever the state of the store at re-execution time… Intentions can be impossible to preserve (nondeterministic operations) or impossible without more order constraints (re-execution time evaluation) PREFIX foaf: http://xmlns.com/foaf/0.1/ WITH <http://example/addresses> DELETE { ?person foaf:givenName 'Bill' } INSERT { ?person foaf:givenName 'William' } WHERE { ?person foaf:givenName 'Bill’ } PREFIX dc: <http://purl.org/dc/elements/1.1/> INSERT DATA { _:book1 dc:title "A new book" ; dc:creator "A.N.Other" . }
  • 22. A CRDT for SPARQL update Rewrite SPARQL UPDATE Operations to another type that does commute, while preserving the original semantics. Graph Update operations work over RDF-Graphs, and can be expressed in terms of the basic INSERT and DELETE of ground triples. RDF-Graphs are Sets of RDF-Triples, and CRDTs for Sets with Insert and Delete already exists…
  • 23. SU-Set : A Sparql-Update CRDT Same id for all triples inserted together, unicity is preserved Less expensive in Communication Need to send (triple,id) when deleting More expensive in communication Better for LLD as knowledge grows…
  • 24. SU-Set
  • 25. Implementation SU-Set embedded into Corese, a reference implementation of RDF-Stores and SPARQL Update by Wimmics team. Rewrite Sparql Update into SUSet, execute and log. “Follow your change” implemented with anti-entropy over logs and version vectors.
  • 26. How much we need to pay to have eventual consistency ? Time Overhead : Adding an id to each element is linear. Selection and lookup is not affected by many pairs with the same triple. Round and # of messages Overhead : Convergence after one round, one message per operation → Optimal
  • 27. Validation – Space Overhead 32 bytes per 1 billion triples = 32 GB → 1 Ipod Two UUIDs, 16 bytes each Semantic Stores already use an internal id → Reuse it (UUID1,UUID2) Extra pairs produced by concurrent insertions could cause problems... Counter Site identifier A version vector for each (monotonically (Useful for authoring too) site: Max size= number of increasing) participants (~300-800).
  • 28. Dynamicity: DBPedia Live Framework to register changes in DBPedia in near real-time. Generates one file with inserted triples and one with deleted triples approximately each 10 seconds. No pattern operations logged  No Overhead here. Many more insertions than deletions Good for SU-Set. Many triples inserted per operation Even better for SU-Set
  • 29. SU-Set Communication overhead on DBPedia Live 7 days of streaming No concurrent insertions IdSize1: 0,096 Kb Avg triple size1: 0,155 Kb Size (MB) # of Triples 21957 Inserts 21762190 3294,08 5334,29 3296,6 21957 Deletes 1755888 265,78 164,61 430,4 54,47% 4,68% Overhead 1Measured Without ids 1 id per triple 1 id per operation Operation with ‘wc –c’ over DBPedia Live log files
  • 30. Conclusion Live Linked Data makes linked data “writable” and allows a new query paradigm. SU-Set is a CRDT for RDF-Graphs updated with SPARQL-Update 1.1 that ensure eventual consistency (convergence + intentions) on Live Linked Data at a reasonable cost.
  • 31. Future Work Finish and release LLD-Corese. Write the composed CRDT for RDF-Datasets What other benefits LLD has for: Querying (SyncAndSearch vs Federations, connection with LAV/GUN, with Adaptivity…?) Discovery…? Community Detection…? Provenance…? Traces