Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
1. www.moving-project.eu
TraininG towards a society of data-saVvy inforMation prOfessionals to enable open leadership INnovation
Till Blume
ZBW – Leibniz Information Centre for Economics
Kiel University
The FLuID Meta Model: Incrementally Compute
Schema-level Indices for the Web of Data
September 26th, 2018, DLR, Jena, Germany.
2. www.moving-project.eu
1 of 29
MOVING search scenario:
• The MOVING platform provides access to large variety of scientific literature,
metadata, videos, social-media content, websites, ….
Information Retrieval (IR) System
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
3. www.moving-project.eu
2 of 29
MOVING search scenario:
• The MOVING platform provides access to large variety of scientific literature,
metadata, videos, social-media content, websites, ….
Information Retrieval (IR) System
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
4. www.moving-project.eu
3 of 29
MOVING search scenario:
• The MOVING platform provides access to large variety of scientific literature,
metadata, videos, social-media content, websites, ….
Information Retrieval (IR) System
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
5. www.moving-project.eu
4 of 29
MOVING search scenario:
• The MOVING platform provides access to large variety of scientific literature,
metadata, videos, social-media content, websites, ….
Information Retrieval (IR) System
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
6. www.moving-project.eu
5 of 29
• Additional metadata from the Web of Data is of great value:
• We use it as additional source of information [4].
• We complement existing metadata.
• We train machine learning models to further improve the IR [5].
MOVING search scenario:
• The MOVING platform provides access to large variety of scientific literature,
metadata, videos, social-media content, websites, ….
Information Retrieval (IR) System
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
7. www.moving-project.eu
6 of 29
Integrating the Web of Data
…
Towards a clean air policy
Great Britain. Central Electricity
foaf:Agent
URI-1 URI-2
bibo:Book
dct:subject
URI-3
MOVING
platform
SLI
2
foaf:Agent
dct:subject
bibo:Book
dct:creator
3
4
Data Integration Service (DIS) for the Web of Data:
1. We formulate a structural query to find matching databases (e.g. containing
bibliographic metadata).
2. The Schema-level Index returns a list of matching databases.
3. Access all databases and check the contained data instances.
4. Harvest the relevant data instances and integrate them into our database.
D
I
S
1
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
8. www.moving-project.eu
7 of 29
Problem Statement
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
• The Web of Data is a huge dynamic heterogeneous network:
• Huge: largest collection of the Web of Data has more 38 billion edges1.
• Dynamic: 40-50% of the data changes either frequently or infrequently [10].
• Heterogeneous: no central authority responsible for data management.
• How to efficiently update schema-level indices over time?
• Re-computing from scratch not feasibly! (Crawls take weeks and
computations days).
• Incremental instance-level graph indices [8,13,14,18,19] not applicable due
to different level of abstraction.
• Incremental schema discovery in NoSQL databases [7,17] not applicable due
to the decentralized nature of the Web of Data.
1http://lodlaundromat.org/
9. www.moving-project.eu
8 of 29
The Meta Model FLuID
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
Our previous
work [2]
Various Index
Models exist
[1,3,6,7,9,11,
12,15,16]
Our ongoing work:
extend to incremental Index computation
Index 1
M2:
Meta model
FLuID
Index Model A:
Characteristic Sets
M1:
Model
M0:
Implementation
Index
Computation Data Graph
<<instanceOf>>
<<instanceOf>>
acts as blueprint
10. www.moving-project.eu
9 of 29
Approach
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
Goal: Develop an incremental schema-level index
• We extend our previously developed meta model FLuID, which can define
arbitrary schema-level indices [2].
• We analyze the effect of different types of changes on the data graph on
schema-level indices.
• We analyze the best and worst case complexity for all types of updates.
• We outline an algorithm using an additional data structure to coordinate the
updates efficiently.
• We experimentally evaluate the expected size of the additional data
structure.
Impact:
• Ease analyses of the evolution of data instances on schema-level.
• Allow „always up-to date“ data caches on schema-level.
11. www.moving-project.eu
10 of 29
• FLuID provides 4 schema elements:
• 3 simple elements: Object Cluster (OC), Property Cluster (PC), and Property-
Object Cluster (POC)
• 1 Complex Schema Element (CSE)
• FLuID provides 5 parameterizations:
• Label parameterization
• Chaining parameterization
• Direction parameterization
• Ontology paramaterization
• Instance parameterization
• In total, FLuID provides 9 building blocks sufficient to model all
existing approaches and beyond.
The FLuID Meta Model
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
12. www.moving-project.eu
11 of 29
• Instances: edges <s,p,o> with same subject node s, i.e.,
((i1, p1, o1), (i2, p2, o2)) ∈ I ⇔ i1 = i2.
• Edges belong to exactly 1 instance, nodes not necessarily
• Since instances partition the data graph, a set of instances also partitions the
data graph.
FLuID: Equivalence Relation Approach
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
i1
i2 i3
i4
i5
i6
i7
i8
i9
i10
p2
p1
p2
p1
p3
p2
p1
13. www.moving-project.eu
12 of 29
• Object Cluster: summarize instances that share a set of connected objects, i.e.,
([i1]I , [i2]I ) ∈ OC ⇔ ∀(i1, p1, o1)∃(i2, p2, o2) : o1 = o2 ∧
∀(i2, p2, o2) ∃(i1, p1, o1) : o1 = o2
The FLuID Model
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
i1
i2 i3
i4
i5
i6
i7
i8
i9
i10
p2
p1
p2
p1
p3
p2
p1
14. www.moving-project.eu
13 of 29
• Label Parameterized Object Cluster: summarize instances that have the set of
connected objects, if the property is p1
The FLuID Model
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
i1
i2 i3
i4
i5
i6
i7
i8
i9
i10
p2
p1
p2
p1
p3
p2
p1
15. www.moving-project.eu
14 of 29
• Label Parameterized Object Cluster: summarize instances that have the set of
connected objects, if the property is rdf:type
The FLuID Model
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
i1
i2
i4
i5
i6
i8
i10
p2
rdf:type
p2
rdf:type
p3
p2
rdf:type
Bbibo:Book
Bfoaf:Agent
Bbibo:Proceedings
16. www.moving-project.eu
15 of 29
• Label Parameterized Object Cluster: summarize instances that have the set of
connected objects, if the property is rdf:type
• Ontology paramaterization: RDFS Schema Graph
The FLuID Model
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
i1 i4 i6
i8
i10
p2
p2
rdf:type
p3
p2
i2 i5 i8
rdf:type rdf:type
Bbibo:Proceedings
Bbibo:Book
Bfoaf:Agent
17. www.moving-project.eu
16 of 29
• Label Parameterized Object Cluster: summarize instances that have the set of
connected objects, if the property is rdf:type
• Ontology paramaterization: RDFS Schema Graph
• Instance parameterization: owl:sameAs
The FLuID Model
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
i1 i4 i6
i10
dct:creator
dct:creator
rdf:type
owl:sameAs
dct:creator
i8i2 i5 i8
rdf:type rdf:type
Bbibo:Proceedings
Bbibo:Book
Bfoaf:Agent
18. www.moving-project.eu
17 of 29
How to compute a Schema-level Index?
Define an index model using the meta model FLuID [2]:
1. Characteristic Sets [12]:
• Instances have the same incoming and outgoing PROPERTIES
Char.Sets:= u-PROPS
2. TermPicker [15]:
• Instances have the same TYPES
• Instances have the same PROPERTIES
• Neighboring Instances have the same TYPES
TermPicker:= (TYPES ∩ PROPS, T, TYPES)
Simple Schema Elements
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
Complex Schema Elements
Simple Schema Element
* 2
(Chaining Parameterization)
19. www.moving-project.eu
18 of 29
Computing a Schema-level Index
i5
Book
author
i4keyword
Subject
Person
author
keyword
pcrel-2
Book
octype-1
~s
~o
~s
Subject
octype-2
Person
cse-1
…
i3
i6
author
i2keyword …
i1
schema
payload
12
Index Size:
α: No. of simple schema elements (3)
β: No. of complex schema elements (1)
Index Model (TYPES ∩ PROPS, T, TYPES):
τ: No. of Complex Schema Elements (1)
ς: No. of Simple Schema Elements (2)
k: Chaining Parameter (1)
Data Graph (Instance Information):
Schema-level Index:
Data Graph:
v: No. of nodes (6)
Upper bound for size:
α + β ≤ v * (ς + τ * k)
4 < 18
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
Example Index Computation for Term Picker:
Computational complexity for (re-) computing from scratch
Book Subject
Person
20. www.moving-project.eu
19 of 29
Subject
Person
Updating a Schema-level Index
schema
payload
12
Data Graph (Instance Information):
Schema-level Index:
Subject
octype-3
~s
~o
~s
cse-2
1 1
Removing instance
information can increase the
index size!
Index Size:
α: No. of simple schema elements (4)
β: No. of complex schema elements (2)
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
Example Index Computation for Term Picker:
i5
Book
author
i4keyword
Subject
Person
…
i3
i6
author
i2keyword …
i1
Book
author
keyword
pcrel-2
Book
octype-1
~s
~o
~s
Subject
octype-2
Person
cse-1
21. www.moving-project.eu
20 of 29
Possible Update Operations
There are six cases of updates possible for an SLI:
1. a new instance is observed with a new schema (SEnew)
2. a new instance is observed with a known schema (PEadd)
3. an instance is observed with a changed schema (SEmod)
4. an instance is observed with only changed instance information (PEmod)
5. an instance no longer exists (PEdel)
6. no more instance with a specific schema exists (SEdel)
We analyzed for each update type the best and worst case time complexity of
the update operation on the schema graph.
Overall Maximum Update Complexity:
ς + τ * k + δ−(I1
max
k )k * τ
No. of Simple + Complex Schema
Elements in the Index Model
In-degree of instances
found in the data graph
δ−(I1
max
k )
Chaining
Parameter k in the
Index Model
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
22. www.moving-project.eu
21 of 29
Incremental Schema-level Index
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
• Base algorithm iterates over all instances and computes the schema according
to the Index Model.
• Incremental algorithm uses a partial history of the indexed data graph to
coordinate the updates – the Update Coordinator (UC)
procedure computeSchema(instance) ๐ O(v)
for all SchemaElement in IndexModel do ๐ O(ς + τ * k )
hash(seprev) ⟵ UC.previousElement(instance)
senew ⟵ extractSchema(instance, SchemaElement) ๐ O(|instance|)
if hash(seprev) ≠ hash(senew) then
seprev ⟵ retrieveFromSLI(hash(seprev))
updateSchemaElement(seprev, senew) ๐ O(|instance|)
parentInstances ⟵ UC.getParentInstances(instance)
for all pInstance in parentInstances do ๐ O(δ− * ς + τ * k )
computeSchema(pInstance)
storeInSLI(senew)
UC.put(instance, senew)
23. www.moving-project.eu
22 of 29
Update Coordinator
DS-
URI-1
DS-
URI-2
DS-
URI-3
I-
URI-67
I-
URI-68
I-
URI-88
I-
URI-89
SE-
URI-12
SE-
URI-13
SE-
URI-55
…
…
…
Schema Element hashs
Instance Element URIs
Data Source URIs +
Timestamps<<timestamp>> <<timestamp>> <<timestamp>>
Store a partial history for the indexed data graph in a 3-layered data structure to
cope with additional level of abstraction:
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
For each
instance one
type of Schema
Element is
linked
Each instance can have
dependencies on zero or more
instances
Each instance is
defined in at
least one data
source
24. www.moving-project.eu
23 of 29
Preliminary Results & Discussion
• Datasets: 4 snapshots of the Web of Data crawled by the Dynamic Linked Data
Observatory (DyLDO1), each about 7 Million Data Instances
• Analyze the datasets with respect to the 3-layered data structure:
• About 28% of the data instances are defined in more than 1 (average 2.2)
data sources.
• About 33% of the instances have decencies on 1 or more (average 5)
parent instances (in-degree).
Implications for additional space complexity DyLDO datasets :
• SLIs using a complex schema element require storing 33% more data.
• SLIs using only simple schema elements require storing 15% more data.
Implications for the update time complexity on DyLDO datasets:
• Updates: ς + τ * k + 5 * (ς + τ * k) (TermPicker: 18)
• Re-computation: v * (ς + τ * k) (TermPicker: 21,000,000)
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
1http://swse.deri.org/dyldo/
25. www.moving-project.eu
24 of 29
Conclusion
1. We analyzed all different types of changes.
2. We outlined an algorithm that can map all instance-level changes to updates on
the schema-level Index.
3. Advantage: Only a small number of Schema Elements may needs to be updated
compared to re-computing from scratch.
4. Limitation: Depending on the Index Model, we may need to store more data.
Future Work
1. Implement and empirically evaluate the performance of the Incremental Schema-
level index algorithm for Real World Datasets (DyLDO snaphots1) & Benchmark
Datasets (Berlin SPARQL Benchmark2 & Lehigh University Benchmark3)
2. Reduce the data overhead by changing the Update Coordinator data structure, e. g.,
by using approximate data structures like bloom filter.
1http://swse.deri.org/dyldo/
2http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/index.html
3http://swat.cse.lehigh.edu/projects/lubm/
Conclusion & Future Work
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
26. www.moving-project.eu
25 of 29
Search Engine Prototype: LODatio+
http://lodatio.informatik.uni-kiel.de
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
27. www.moving-project.eu
26 of 29
Search Engine Prototype: LODatio+
http://lodatio.informatik.uni-kiel.de
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
28. www.moving-project.eu
27 of 29
Search Engine Prototype: LODatio+
http://lodatio.informatik.uni-kiel.de
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
29. www.moving-project.eu
28 of 29
1. Index the LOD cloud
Schema-level Index
FluID FrameworkLOD Crawler
2. Identify the
relevant sources
Harvey Framework
Data cloud
Focused
Crawler
JSON-Mapping Query
Index
Discovery
System
3. Harvest the
relevant sources
4. Crawl & Harvest
additional relevant sources
Data Integration Service
Ongoing Work
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
30. www.moving-project.eu
29 of 29
Thank you for your attention!
Any questions?
Project consortium and funding agency
MOVING is funded by the EU Horizon 2020 Programme under the project number INSO-4-2015: 693092
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
31. www.moving-project.eu
30 of 29
References
1. Benedetti, F., Bergamaschi, S., and Po., L.: Exposing the underlying schema of LOD sources. In Joint
IEEE/WIC/ACM WI and IAT, 2015.
2. Blume, T., Scherp, A.: Towards flexible indices for distributed graph data: The formal schema-level index model
FLuID. In Foundations of Databases. CEURWS.org, 2018.
3. Ciglan, M., Nørv˚ag, K., and Hluch´y, L.: The SemSets model for ad-hoc semantic list search. In WWW, 2012.
4. Galke, L., Mai, F., Schelten, A., Brunsch, D., Scherp, A.: Using titles vs. full-text as source for automated semantic
document annotation. In K-CAP, 2017.
5. Galke, L., Saleh, A., Scherp, A.: Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical
Information Retrieval. In INFORMATIK, 2017.
6. Goldman, R. and Widom, J.: DataGuides: Enabling query formulation and optimization in semistructured
databases. In VLDB, 1997.
7. Gómez, S.N., Etcheverry, L., Marotta, A., Consens, M.P.: Findings from two decades of research on schema
discovery using a systematic literature review. In AMW. CEUR-WS.org, 2018.
8. Kansal, A., Spezzano, F.: A scalable graph-coarsening based index for dynamic graph databases. In CIKM, 2017.
9. Konrath, M., Gottron., T., Staab, S., and Scherp, A.: SchemEX - efficient construction of a data catalogue by
stream-based indexing of Linked Data. In J. Web Sem., 16:52–58, 2012.
10. Käfer, T., Abdelrahman, A., Umbrich, J., O’Byrne, P., Hogan, A.: Observing linkedn data dynamics. In ESWC.
Springer, 2013.
11. McHugh, J., Abiteboul, S., Goldman, R., Quass, D., and Widom, J.: Lore: a database management system for
semistructured data. In SIGMOD Record, 26(3):54–66, 1997.
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
32. www.moving-project.eu
31 of 29
References
12. Neumann, T. and Moerkotte, G.: Characteristic sets: Accurate cardinality estimation for RDF queries with multiple
joins. In ICDE, 2011.
13. Qiao, M., Zhang, H., Cheng, H.: Subgraph matching: on compression and computation. In PVLDB, 11(2):176–188,
2017.
14. Sakr, S., Al-Naymat, G.: Graph indexing and querying: a review. In J. of Web Inf. Sys. 6(2):101–120, 2010.
15. Schaible, J., Gottron, T., and Scherp, A.: TermPicker: Enabling the reuse of vocabulary terms by exploiting data
from the Linked Open Data cloud. In ESWC, 2016.
16. Spahiu, B., Porrini, R., Palmonari, M., Rula, A., and Maurino, A.: ABSTAT: ontology-driven Linked Data summaries
with pattern minimalization. In ESWC Satellite Events, Revised Selected Papers, 2016.
17. Wang, L., Zhang, S., Shi, J., Jiao, L., Hassanzadeh, O., Zou, J., Wangz, C.: Schema management for document
stores. In VLDB 8(9):922–933, 2015.
18. Yuan, D., Mitra, P., Yu, H., Giles, C.L.: Iterative graph feature mining for graph indexing. In ICDE, 2012.
19. Yuan, D., Mitra, P., Yu, H., Giles, C.L.: Updating graph indices with a one-pass algorithm. In SIGMOD, 2015.
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web of Data
Editor's Notes
In the context of the european H2020 project MOVING, we developed a Information Retrieval System
In the context of the european H2020 project MOVING, we developed a Information Retrieval System
In the context of the european H2020 project MOVING, we developed a Information Retrieval System
In the context of the european H2020 project MOVING, we developed a Information Retrieval System
In the context of the european H2020 project MOVING, we developed a Information Retrieval System
The fist challenge we faced: lots of variations in the data schema: solution -> next slide
^There are more features available, like ontology reasoning, but they do not impact the update complexity
For each instance one type of Schema Element is linked (Index Model)
Each instance can be defined in at least one data source (Data Graph)
Each instance can have dependencies on zero or more instances due to complex schema elements (Data Graph)