Decentralized Evolution and Consolidation of RDF Graphs

Decentralized Evolution and Consolidation of RDF Graphs
Natanael Arndt and Michael Martin
June 7, 2017
ICWE 2017, Rome
LEDS
INKED NTERPRISE ATA ERVICESL E D S

Introduction: Pfarrerbuch
content editor
(Project Team)
[protected zone]
SPARQL
Endpoint
HTML GUI
[stable]
OntoWiki
Persistency Layer
Backup Model
query, search
add, edit, maintain
3 / 45

Introduction: Catalogus Professorum Lipsiensium
synchronize
Model Data (SPARUL)
Linked Data
Linked Data
Partial RDF export
Full RDF export
Backup Model
experienced
web user
content editor
(Project Team)
general
web user
SPARQL
Endpoint
HTML GUI
[stable]
OntoWiki
Persistency Layer
SPARQL
Endpoint
HTML GUI
[experimental]
OntoWiki
Persistency Layer
HTML GUI
[stable]
CPL Frontend
Persistency Layer
OCPY
TOWEL
conﬁgure
conﬁgure
query, search
add, edit, maintain
getData
query, search
browse, annotate, discuss
synchronize
Model Data
synchronize
Model Data
browse, search
[protected zone]
[public zone]
4 / 45

Introduction
• Central SPARQL endpoints
• Single Point of Failure, Unavailability
• One consolidated status of the data
• Only trusted access allowed
• Asynchronous collaboration leads to inconsistency
5 / 45

Introduction: From Software Engineer To Data Engineering
• In Software Engineering the term Software Crisis was coined
• Systems and problems became more an more complex
• Software Engineering Methods made the process of creating software more
controllable
• Conﬁguration Management brought Source Code Management
• CVS and SVN central systems
• Darcs, Mercurial, Git decentralized
• Git widely used (even Microsoft switched the Windows development to Git)
6 / 45

• Subject of collaboration are RDF Graphs rather then Source Code Files
• Consistency checks in DSCM ecosystem are made using Continous
integration
7 / 45

8 / 45

Related Work/State of the Art
Approach storage quad
support
bnodes branches merge push/pull
TailR [4] hybrid noa yes nof no (yes)h
Eccrev [2] delta yes yes nof no no
R43ples [3] delta nob,c (yes)d yes no no
R&W base [5] delta noc (yes)e yes (yes)g no
dat chunks n/a n/a no no yes
a
The granularity of versioning are repositories; b
Only single graphs are put under version control;
c
The context is used to encode revisions; d
Blank nodes are skolemized; e
Blank nodes are
addressed by internal identiﬁers; f
Only linear change tracking is supported; g
Naive merge
implementation; h
No pull requests but history replication via memento API
9 / 45

Preliminaries
• Atomic Graph
• Atomic Partition
• Diﬀerence
• Change
• Application of a Change
10 / 45

Preliminaries: Atomic Graph
A
C
B
D E
11 / 45

Preliminaries: Atomic Graph
A
C
B
D E
12 / 45

Preliminaries: Atomic Partition
A
C
B
D E
A
C
D
D E
A
C
B
D
13 / 45

Preliminaries: Atomic Partition
A
C
B
D E
A
C
D
D E
A
C
B
D
14 / 45

Preliminaries: Diﬀerence
∆(G, G′
) := (C+
, C−
)
Δ
A
C
B
D E
A
C
B
D E
15 / 45

C−
:=
˙∪ (
˘P
(
P(G) P(G′
)
))
A
C
B
D E
A
C
B
D E
⋃
16 / 45

C−
:=
˙∪ (
˘P
(
P(G) P(G′
)
))
A
C
D
D E
A
C
B
D
A
C
B
E
D E
C
D
A
⋃
A
C
B
D

17 / 45

C+
:=
˙∪ (
˘P
(
P(G′
) P(G)
))
A
C
D
D E
A
C
B
D
A
C
B
E
D E
C
D
A
⋃
A
C
B
E
18 / 45

Preliminaries: Diﬀerence resp. Change
C+
:=
˙∪ (
˘P
(
P(G′
) P(G)
))
C−
:=
˙∪ (
˘P
(
P(G) P(G′
)
))
∆(G, G′
) := (C+
, C−
)
A
C
B
D
A
C
B
E
Δ
A
C
B
D E
A
C
B
D E
19 / 45

Preliminaries: Application of a Change
Apl(G, (C+
G , C−
G )) :=
˙∪ (
˘P
(
(P(G) P(C−
G )) ∪ P(C+
G )
))
A
C
B
D
A
C
B
E
A
C
B
D E
Apl
20 / 45

Apl(G, (C+
G , C−
G )) :=
˙∪ (
˘P
(
(P(G) P(C−
G )) ∪ P(C+
G )
))
A
C
B
E
D E
C
D
A
A
C
D
D E
A
C
B
D
A
C
B
D
∪
A
C
B
E
21 / 45

Apl(G, (C+
G , C−
G )) :=
˙∪ (
˘P
(
(P(G) P(C−
G )) ∪ P(C+
G )
))
A
C
B
E
D E
C
D
A
A
C
D
D E
A
C
B
D
A
C
B
D
∪
A
C
B
E
A
C
B
D E
22 / 45

Apl(G, (C+
G , C−
G )) :=
˙∪ (
˘P
(
(P(G) P(C−
G )) ∪ P(C+
G )
))
A
C
B
D
A
C
B
E
A
C
B
D E
Apl
A
C
B
D E
23 / 45

Operations
• Commit
• Distributed Evolution
• Merge of Two Evolved Graphs
• Revert a Commit
24 / 45

Operations: Commit
A
A({G0
})
25 / 45

Operations: Commit
A
A({G0
})
Apl(G0
, (C+
G0 , C−
G0 )) = G
25 / 45

Operations: Commit
A B
A({G0
})
Apl(G0
, (C+
G0 , C−
G0 )) = G
B{A}({G})
25 / 45

Operations: Commit
A B C
A({G0
})
Apl(G0
, (C+
G0 , C−
G0 )) = G
B{A}({G})
C{B{A}}({G′
})
25 / 45

Operations: Distributed Evolution
A B C
D
Figure 1: Two branches evolved from a common commit
D{B{A}}({G′′
})
26 / 45

Operations: Merge of Two Evolved Graphs
Merge(C({G′
}), D({G′′
})) = E{C,D}({G′′′
})
A B C
D
E
Figure 2: Merging commits from two branches into a common version of the graph
27 / 45

Operations: Revert a Commit
∆−1
(G0
, G) = ∆(G, G0
)
A B B−1
Figure 3: A commit reverting the previous commit
28 / 45

Merge Strategies
• Union Merge
• All Ours/All Theirs
• Three-Way Merge
29 / 45

Merge Strategies: Union Merge
A
C
B
D E
A
C
D
D E
A
C
B
D
A
C
B
E
D E
C
D
A
A
C
B
D E
∪⋃
A
C
B
D E
=
30 / 45

Merge Strategies: All Ours/All Theirs
A
C
B
D E
A
C
B
D E
A
C
B
D E
X
31 / 45

Merge Strategies: Three-Way Merge
A
C
B
D E
A
C
B
D E
A
C
B
D E
32 / 45

A
C
B
D E
A
C
B
D E
A
C
B
D E
E
C⁺=
C⁺=
C⁻=
A
C
B
A
C
B
D
33 / 45

A
C
B
D E
A
C
B
D E
A
C
B
D E
A
C
B
D E
E
C⁺=
C⁺=
C⁻=
A
C
B
A
C
B
D
34 / 45

Evaluation
• We have a prototypical implementation of our concepts: QuitStore1
• We are using the Berlin SPARQL benchmark (BSBM) for evaluating our system
• We are using the Explore and Update Use Case since it provides SPARQL query
and update operations
• All scripts used are available at https://github.com/AKSW/QuitEval
1
https://github.com/AKSW/QuitStore
35 / 45

Evaluation: Correctness of Version Tracking
• We take git repository, the initial data set and the query execution log
(run.log) produced by BSBM
• We load the data into a store and execute all queries stored in the run.log
• We could verify, that the state of the store was always similar to the content
of the respective git commit
36 / 45

Evaluation: Correctness of Merge Method
• We take the graph generated by BSBM
• We generate branches with two randomly diﬀerent sets of added and
deleted statements
• We generate a graph containing the expected result of the merge operation
• We execute the merge operation and compare the resulting graph to the
expected result
• This process was repeated 1000 times
37 / 45

Evaluation: Performance
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
0 500 1000 1500 2000 2500 3000 3500 4000 4500
0
200
400
600
800
1000
1200
MiB
MiB(memory)
#commits
Quit repo size
Quit with gc repo size
Quit memory
Quit with gc memory
38 / 45

Evaluation: Performance
0.01
0.1
1
10
100
1000
IN
SERT
D
ATA
D
ELETE
W
H
EREExplore
1Explore
2Explore
3Explore
4Explore
5Explore
7Explore
8Explore
9
Explore
10
Explore
11
Explore
12
queriespersecond(qps) quit versioning
quit versioning with gc
no versioning (baseline)
39 / 45

Conclusion
• Presented a formal framework for the distributed evolution of RDF
knowledge bases
• Atomic operations on RDF graphs
• Formalized deﬁnitions of the versioning operations: commit, branch, merge
and revert
• Quad aware, handle blank nodes, supports branches, supports merging with
conﬂict resolution, allows distributed collaboration with push and pull
• Merge strategies where transfered to the application on atomic graphs to be
used on RDF datasets
40 / 45

Future Work
• Improve our Quit Store implementation to support the complete framework
• Explore provenance tracked by Git through an RDF interface ✓ [1]
• Implement the Quit architecture for real world problems
41 / 45

Future Work
experienced
web user
general
web user
content editor
(Project Team)
[protected zone]
SPARQL
Endpoint
HTML GUI
[stable]
OntoWiki
Persistency Layer
query, search
add, edit, maintain
clone/fetch/push
public + private
Data
Data
Transformation
Tasks (ETL)
add new Data
Legacy Data Sources
[public zone]
any RDF
Editor
Commenting
Interface
Browsing
Interfacequery, search
comment
42 / 45

References I
N. Arndt, P. Naumann, and E. Marx.
Exploring the evolution and provenance of git versioned rdf data.
In J. D. Fernández, J. Debattista, and J. Umbrich, editors, 3rd Workshop on
Managing the Evolution and Preservation of the Data Web (MEPDaW) co-located
with 14th European Semantic Web Conference (ESWC 2017), Portoroz, Slovenia,
May 2017.
M. Frommhold, R. N. Piris, N. Arndt, S. Tramp, N. Petersen, and M. Martin.
Towards Versioning of Arbitrary RDF Data.
In 12th International Conference on Semantic Systems Proceedings (SEMANTiCS
2016), SEMANTiCS ’16, Leipzig, Germany, Sept. 2016.
43 / 45

References II
M. Graube, S. Hensel, and L. Urbas.
Open semantic revision control with r43ples: Extending sparql to access
revisions of named graphs.
In Proceedings of the 12th International Conference on Semantic Systems,
SEMANTiCS 2016, pages 49–56, New York, NY, USA, 2016. ACM.
P. Meinhardt, M. Knuth, and H. Sack.
Tailr: A platform for preserving history on the web of data.
In Proceedings of the 11th International Conference on Semantic Systems,
SEMANTICS ’15, pages 57–64, New York, NY, USA, 2015. ACM.
44 / 45

References III
M. V. Sande, P. Colpaert, R. Verborgh, S. Coppens, E. Mannens, and R. V.
de Walle.
R&wbase: git for triples.
In C. Bizer, T. Heath, T. Berners-Lee, M. Hausenblas, and S. Auer, editors,
LDOW, volume 996 of CEUR Workshop Proceedings. CEUR-WS.org, 2013.
45 / 45

Decentralized Evolution and Consolidation of RDF Graphs

More Related Content

What's hot

Similar to Decentralized Evolution and Consolidation of RDF Graphs

Recently uploaded

Decentralized Evolution and Consolidation of RDF Graphs