The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data

www.moving-project.eu
TraininG towards a society of data-saVvy inforMation prOfessionals to enable open leadership INnovation
Till Blume
ZBW – Leibniz Information Centre for Economics
Kiel University
The FLuID Meta Model: Incrementally Compute
Schema-level Indices for the Web of Data
September 26th, 2018, DLR, Jena, Germany.

1 of 29
MOVING search scenario:
• The MOVING platform provides access to large variety of scientific literature,
metadata, videos, social-media content, websites, ….
Information Retrieval (IR) System
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data

2 of 29

3 of 29

4 of 29

5 of 29
• Additional metadata from the Web of Data is of great value:
• We use it as additional source of information [4].
• We complement existing metadata.
• We train machine learning models to further improve the IR [5].

6 of 29
Integrating the Web of Data
…
Towards a clean air policy
Great Britain. Central Electricity
foaf:Agent
URI-1 URI-2
bibo:Book
dct:subject
URI-3
MOVING
platform
SLI
2
foaf:Agent
dct:subject
bibo:Book
dct:creator
3
4
Data Integration Service (DIS) for the Web of Data:
1. We formulate a structural query to find matching databases (e.g. containing
bibliographic metadata).
2. The Schema-level Index returns a list of matching databases.
3. Access all databases and check the contained data instances.
4. Harvest the relevant data instances and integrate them into our database.
D
I
S
1

7 of 29
Problem Statement
• The Web of Data is a huge dynamic heterogeneous network:
• Huge: largest collection of the Web of Data has more 38 billion edges1.
• Dynamic: 40-50% of the data changes either frequently or infrequently [10].
• Heterogeneous: no central authority responsible for data management.
• How to efficiently update schema-level indices over time?
• Re-computing from scratch not feasibly! (Crawls take weeks and
computations days).
• Incremental instance-level graph indices [8,13,14,18,19] not applicable due
to different level of abstraction.
• Incremental schema discovery in NoSQL databases [7,17] not applicable due
to the decentralized nature of the Web of Data.
1http://lodlaundromat.org/

8 of 29
The Meta Model FLuID
Our previous
work [2]
Various Index
Models exist
[1,3,6,7,9,11,
12,15,16]
Our ongoing work:
extend to incremental Index computation
Index 1
M2:
Meta model
FLuID
Index Model A:
Characteristic Sets
M1:
Model
M0:
Implementation
Index
Computation Data Graph
<<instanceOf>>
<<instanceOf>>
acts as blueprint

9 of 29
Approach
Goal: Develop an incremental schema-level index
• We extend our previously developed meta model FLuID, which can define
arbitrary schema-level indices [2].
• We analyze the effect of different types of changes on the data graph on
schema-level indices.
• We analyze the best and worst case complexity for all types of updates.
• We outline an algorithm using an additional data structure to coordinate the
updates efficiently.
• We experimentally evaluate the expected size of the additional data
structure.
Impact:
• Ease analyses of the evolution of data instances on schema-level.
• Allow „always up-to date“ data caches on schema-level.

10 of 29
• FLuID provides 4 schema elements:
• 3 simple elements: Object Cluster (OC), Property Cluster (PC), and Property-
Object Cluster (POC)
• 1 Complex Schema Element (CSE)
• FLuID provides 5 parameterizations:
• Label parameterization
• Chaining parameterization
• Direction parameterization
• Ontology paramaterization
• Instance parameterization
• In total, FLuID provides 9 building blocks sufficient to model all
existing approaches and beyond.
The FLuID Meta Model

11 of 29
• Instances: edges <s,p,o> with same subject node s, i.e.,
((i1, p1, o1), (i2, p2, o2)) ∈ I ⇔ i1 = i2.
• Edges belong to exactly 1 instance, nodes not necessarily
• Since instances partition the data graph, a set of instances also partitions the
data graph.
FLuID: Equivalence Relation Approach
i1
i2 i3
i4
i5
i6
i7
i8
i9
i10
p2
p1
p2
p1
p3
p2
p1

12 of 29
• Object Cluster: summarize instances that share a set of connected objects, i.e.,
([i1]I , [i2]I ) ∈ OC ⇔ ∀(i1, p1, o1)∃(i2, p2, o2) : o1 = o2 ∧
∀(i2, p2, o2) ∃(i1, p1, o1) : o1 = o2
The FLuID Model
i1
i2 i3
i4
i5
i6
i7
i8
i9
i10
p2
p1
p2
p1
p3
p2
p1

13 of 29
• Label Parameterized Object Cluster: summarize instances that have the set of
connected objects, if the property is p1
The FLuID Model
i1
i2 i3
i4
i5
i6
i7
i8
i9
i10
p2
p1
p2
p1
p3
p2
p1

14 of 29
connected objects, if the property is rdf:type
The FLuID Model
i1
i2
i4
i5
i6
i8
i10
p2
rdf:type
p2
rdf:type
p3
p2
rdf:type
Bbibo:Book
Bfoaf:Agent
Bbibo:Proceedings

15 of 29
• Ontology paramaterization: RDFS Schema Graph
The FLuID Model
i1 i4 i6
i8
i10
p2
p2
rdf:type
p3
p2
i2 i5 i8
rdf:type rdf:type
Bbibo:Proceedings
Bbibo:Book
Bfoaf:Agent

16 of 29
• Ontology paramaterization: RDFS Schema Graph
• Instance parameterization: owl:sameAs
The FLuID Model
i1 i4 i6
i10
dct:creator
dct:creator
rdf:type
owl:sameAs
dct:creator
i8i2 i5 i8
rdf:type rdf:type
Bbibo:Proceedings
Bbibo:Book
Bfoaf:Agent

17 of 29
How to compute a Schema-level Index?
Define an index model using the meta model FLuID [2]:
1. Characteristic Sets [12]:
• Instances have the same incoming and outgoing PROPERTIES
Char.Sets:= u-PROPS
2. TermPicker [15]:
• Instances have the same TYPES
• Instances have the same PROPERTIES
• Neighboring Instances have the same TYPES
TermPicker:= (TYPES ∩ PROPS, T, TYPES)
Simple Schema Elements
Complex Schema Elements
Simple Schema Element
* 2
(Chaining Parameterization)

18 of 29
Computing a Schema-level Index
i5
Book
author
i4keyword
Subject
Person
author
keyword
pcrel-2
Book
octype-1
~s
~o
~s
Subject
octype-2
Person
cse-1
…
i3
i6
author
i2keyword …
i1
schema
payload
12
Index Size:
α: No. of simple schema elements (3)
β: No. of complex schema elements (1)
Index Model (TYPES ∩ PROPS, T, TYPES):
τ: No. of Complex Schema Elements (1)
ς: No. of Simple Schema Elements (2)
k: Chaining Parameter (1)
Data Graph (Instance Information):
Schema-level Index:
Data Graph:
v: No. of nodes (6)
Upper bound for size:
α + β ≤ v * (ς + τ * k)
4 < 18
Example Index Computation for Term Picker:
Computational complexity for (re-) computing from scratch
Book Subject
Person

19 of 29
Subject
Person
Updating a Schema-level Index
schema
payload
12
Data Graph (Instance Information):
Schema-level Index:
Subject
octype-3
~s
~o
~s
cse-2
1 1
Removing instance
information can increase the
index size!
Index Size:
α: No. of simple schema elements (4)
β: No. of complex schema elements (2)
Example Index Computation for Term Picker:
i5
Book
author
i4keyword
Subject
Person
…
i3
i6
author
i2keyword …
i1
Book
author
keyword
pcrel-2
Book
octype-1
~s
~o
~s
Subject
octype-2
Person
cse-1

20 of 29
Possible Update Operations
There are six cases of updates possible for an SLI:
1. a new instance is observed with a new schema (SEnew)
2. a new instance is observed with a known schema (PEadd)
3. an instance is observed with a changed schema (SEmod)
4. an instance is observed with only changed instance information (PEmod)
5. an instance no longer exists (PEdel)
6. no more instance with a specific schema exists (SEdel)
We analyzed for each update type the best and worst case time complexity of
the update operation on the schema graph.
Overall Maximum Update Complexity:
ς + τ * k + δ−(I1
max
k )k * τ
No. of Simple + Complex Schema
Elements in the Index Model
In-degree of instances
found in the data graph
δ−(I1
max
k )
Chaining
Parameter k in the
Index Model

21 of 29
Incremental Schema-level Index
• Base algorithm iterates over all instances and computes the schema according
to the Index Model.
• Incremental algorithm uses a partial history of the indexed data graph to
coordinate the updates – the Update Coordinator (UC)
procedure computeSchema(instance) ๐ O(v)
for all SchemaElement in IndexModel do ๐ O(ς + τ * k )
hash(seprev) ⟵ UC.previousElement(instance)
senew ⟵ extractSchema(instance, SchemaElement) ๐ O(|instance|)
if hash(seprev) ≠ hash(senew) then
seprev ⟵ retrieveFromSLI(hash(seprev))
updateSchemaElement(seprev, senew) ๐ O(|instance|)
parentInstances ⟵ UC.getParentInstances(instance)
for all pInstance in parentInstances do ๐ O(δ− * ς + τ * k )
computeSchema(pInstance)
storeInSLI(senew)
UC.put(instance, senew)

22 of 29
Update Coordinator
DS-
URI-1
DS-
URI-2
DS-
URI-3
I-
URI-67
I-
URI-68
I-
URI-88
I-
URI-89
SE-
URI-12
SE-
URI-13
SE-
URI-55
…
…
…
Schema Element hashs
Instance Element URIs
Data Source URIs +
Timestamps<<timestamp>> <<timestamp>> <<timestamp>>
Store a partial history for the indexed data graph in a 3-layered data structure to
cope with additional level of abstraction:
For each
instance one
type of Schema
Element is
linked
Each instance can have
dependencies on zero or more
instances
Each instance is
defined in at
least one data
source

23 of 29
Preliminary Results & Discussion
• Datasets: 4 snapshots of the Web of Data crawled by the Dynamic Linked Data
Observatory (DyLDO1), each about 7 Million Data Instances
• Analyze the datasets with respect to the 3-layered data structure:
• About 28% of the data instances are defined in more than 1 (average 2.2)
data sources.
• About 33% of the instances have decencies on 1 or more (average 5)
parent instances (in-degree).
Implications for additional space complexity DyLDO datasets :
• SLIs using a complex schema element require storing 33% more data.
• SLIs using only simple schema elements require storing 15% more data.
Implications for the update time complexity on DyLDO datasets:
• Updates: ς + τ * k + 5 * (ς + τ * k) (TermPicker: 18)
• Re-computation: v * (ς + τ * k) (TermPicker: 21,000,000)
1http://swse.deri.org/dyldo/

24 of 29
Conclusion
1. We analyzed all different types of changes.
2. We outlined an algorithm that can map all instance-level changes to updates on
the schema-level Index.
3. Advantage: Only a small number of Schema Elements may needs to be updated
compared to re-computing from scratch.
4. Limitation: Depending on the Index Model, we may need to store more data.
Future Work
1. Implement and empirically evaluate the performance of the Incremental Schema-
level index algorithm for Real World Datasets (DyLDO snaphots1) & Benchmark
Datasets (Berlin SPARQL Benchmark2 & Lehigh University Benchmark3)
2. Reduce the data overhead by changing the Update Coordinator data structure, e. g.,
by using approximate data structures like bloom filter.
1http://swse.deri.org/dyldo/
2http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/index.html
3http://swat.cse.lehigh.edu/projects/lubm/
Conclusion & Future Work

25 of 29
Search Engine Prototype: LODatio+
http://lodatio.informatik.uni-kiel.de

26 of 29

27 of 29

28 of 29
1. Index the LOD cloud
Schema-level Index
FluID FrameworkLOD Crawler
2. Identify the
relevant sources
Harvey Framework
Data cloud
Focused
Crawler
JSON-Mapping Query
Index
Discovery
System
3. Harvest the
relevant sources
4. Crawl & Harvest
additional relevant sources
Data Integration Service
Ongoing Work

29 of 29
Thank you for your attention!
Any questions?
Project consortium and funding agency
MOVING is funded by the EU Horizon 2020 Programme under the project number INSO-4-2015: 693092

30 of 29
References
1. Benedetti, F., Bergamaschi, S., and Po., L.: Exposing the underlying schema of LOD sources. In Joint
IEEE/WIC/ACM WI and IAT, 2015.
2. Blume, T., Scherp, A.: Towards flexible indices for distributed graph data: The formal schema-level index model
FLuID. In Foundations of Databases. CEURWS.org, 2018.
3. Ciglan, M., Nørv˚ag, K., and Hluch´y, L.: The SemSets model for ad-hoc semantic list search. In WWW, 2012.
4. Galke, L., Mai, F., Schelten, A., Brunsch, D., Scherp, A.: Using titles vs. full-text as source for automated semantic
document annotation. In K-CAP, 2017.
5. Galke, L., Saleh, A., Scherp, A.: Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical
Information Retrieval. In INFORMATIK, 2017.
6. Goldman, R. and Widom, J.: DataGuides: Enabling query formulation and optimization in semistructured
databases. In VLDB, 1997.
7. Gómez, S.N., Etcheverry, L., Marotta, A., Consens, M.P.: Findings from two decades of research on schema
discovery using a systematic literature review. In AMW. CEUR-WS.org, 2018.
8. Kansal, A., Spezzano, F.: A scalable graph-coarsening based index for dynamic graph databases. In CIKM, 2017.
9. Konrath, M., Gottron., T., Staab, S., and Scherp, A.: SchemEX - efficient construction of a data catalogue by
stream-based indexing of Linked Data. In J. Web Sem., 16:52–58, 2012.
10. Käfer, T., Abdelrahman, A., Umbrich, J., O’Byrne, P., Hogan, A.: Observing linkedn data dynamics. In ESWC.
Springer, 2013.
11. McHugh, J., Abiteboul, S., Goldman, R., Quass, D., and Widom, J.: Lore: a database management system for
semistructured data. In SIGMOD Record, 26(3):54–66, 1997.

31 of 29
References
12. Neumann, T. and Moerkotte, G.: Characteristic sets: Accurate cardinality estimation for RDF queries with multiple
joins. In ICDE, 2011.
13. Qiao, M., Zhang, H., Cheng, H.: Subgraph matching: on compression and computation. In PVLDB, 11(2):176–188,
2017.
14. Sakr, S., Al-Naymat, G.: Graph indexing and querying: a review. In J. of Web Inf. Sys. 6(2):101–120, 2010.
15. Schaible, J., Gottron, T., and Scherp, A.: TermPicker: Enabling the reuse of vocabulary terms by exploiting data
from the Linked Open Data cloud. In ESWC, 2016.
16. Spahiu, B., Porrini, R., Palmonari, M., Rula, A., and Maurino, A.: ABSTAT: ontology-driven Linked Data summaries
with pattern minimalization. In ESWC Satellite Events, Revised Selected Papers, 2016.
17. Wang, L., Zhang, S., Shi, J., Jiao, L., Hassanzadeh, O., Zou, J., Wangz, C.: Schema management for document
stores. In VLDB 8(9):922–933, 2015.
18. Yuan, D., Mitra, P., Yu, H., Giles, C.L.: Iterative graph feature mining for graph indexing. In ICDE, 2012.
19. Yuan, D., Mitra, P., Yu, H., Giles, C.L.: Updating graph indices with a one-pass algorithm. In SIGMOD, 2015.

The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data

Similar to The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data (20)

Recently uploaded

Recently uploaded (20)

The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data

Editor's Notes