Indexing data on the web a comparison of schema level indices for data search

Indexing Data on the Web: A Comparison
of Schema-level Indices for Data Search
Till Blume1
and Ansgar Scherp2
DEXA 2020, 16th September, Virtual Event
1
Kiel University, Germany
2
Ulm University, Germany

…
Data Source Search with Graph Indices
Index
1
foaf:Agent
dct:subject bibo:Book
dct:creator
3
2
2
Towards a clean air policy
dct:creator
Great Britain. Central Electricity
foaf:Agent
URI-1 URI-2
bibo:Book
dct:subjectURI-3

Graph Indices
● General Purpose Graph indices: path, tree, and subgraph indices.
● Instance-level indices: find specific data instances, e.g., the book “Towards a
clean air policy”.
● Schema-level indices: find all data instances of type “Book”.
3

Problem Definition & Research Question
● Schema-level indices (SLIs) are designed, implemented, and evaluated for
their individual task only, using different queries, datasets, and metrics.
● Limited work has been done on understanding them in different contexts.
● Our hypothesis is that there is no SLI model that fits all tasks and
that the performance of the specific SLI depends on the specific types
of queries and characteristics of the datasets.
4

Contribution
1. We conduct an extensive empirical evaluation of representative SLI models.
To this end, we have for the first time defined and implemented the features of
existing SLI models in a common framework available on GitHub.
2. We evaluate Schema-level Indices independent of their original application for
a data search scenario.
3. We extend the Schema-level Index SchemEX [9] with RDF Schema
inferencing and owl:sameAs inferencing.
5

Computing the Schema-level Index SchemEX
RDF data graphSchema-level index
Book
i-1 i-2
Subject
i-3
Agent
p-1
p-2
Schema Element SE-1
Book
s-1 o-2
Subject
o-3
Agent Source ds-4
p-1
p-2
Data Instance Is-1
Book
s-91 o-92
Subject
o-93
Agent Source ds-7
p-1
p-2
Data Instance Is-91
6

?x
p1
pl
pl+1
pn
… …
?x
?y1
?yn
…
p1
pn
(a) Characteristic Sets [1] (c) Sem Sets [7]
p1
pl
pl+1
pn
… …
……
?x
(b) Weak Property Cliques [10]
?x
?y1
?yn
…
p1
pn
?c1
?cm
…
rdf:type
?c1
rdf:type
?cm
…
?c1
?cm
rdf:type…
(d) LODex [2], Loupe [3],
ABSTAT [5], SchemEX [9]
?yi
?c1
?cm
…?x
?c1
?cm
rdf:type… rdf:type
p1
∪ … ∪ pn
(e) Term Picker [6]
Analysis of representative SLIs
7

Experimental Setup
Implementation of a single, parameterized algorithm 1
Datasets crawled from the Web of Data:
● TimBL-11M: 11 million triples crawl starting from a single seed URI
● DyLDo-127M: 127 million triples crawl starting from 95k seed URIs
Index Models:
● Characteristic Sets, SemSets, Weak Property Cliques, SchemEX, SchemEX+U+I
● Use sub-graph sizes of k = 0, 1, and 2 hop distances for each model
1. https://github.com/t-blume/fluid-framework 8

Experiment 1: Compression and Summarization
9

Experiment 2: Stream-based Index Computation
● scalable computation by observing the graph over a stream of edges with a fixed window size
● approximation errors due to incomplete data
● evaluate the F1-score of data search queries for approximate indices compared to exact indices
● change the window size (1k, 100k, and 200k instances)
Main results: the approximation quality of an index computed in a stream-based
approach depends on three factors:
a. characteristics of the crawled dataset
b. simple queries consistently outperform complex queries
c. a larger window size typically improves the quality only marginally
10

Main Observations
1. Significant correlation between compression ratio and summarization ratio
(intuition: superior summarization -> superior compression)
2. Significant negative correlation between summarization ratio and
approximation quality (intuition: superior summarization -> superior
approximation quality)
3. Dataset characteristic is important: Over all indices, indices have on average
a 0.15 lower F1-score on the DyLDO-127M dataset compared to the
TimBL-11M dataset.
11

Conclusion
● Huge variations in compression ratio, summarization ratio, and approximation
quality for different Schema-level Indices, queries, and datasets.
● Schema-level indices developed for only a specific application can also be
beneficial in different applications, e.g., Characteristic Sets or TermPicker.
● Inferencing over RDF Schema can lead to a superior summarization ratio.
Please find on GitHub our source code, the queries, and the detailed statistics
about the TimBL-11M and DyLDO-127M datasets:
http://github.com/t-blume/fluid-framework
12

References
1. Neumann, T., Moerkotte, G.: Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In: ICDE. pp. 984–994.
IEEE (2011)
2. Benedetti, F., Bergamaschi, S., Po, L.: Exposing the underlying schema of LOD sources. In: Joint IEEE/WIC/ACM WI and IAT. pp. 301–304.
IEEE (2015)
3. Mihindukulasooriya, N., Poveda-Villal ́on, M., Garc ́ıa-Castro, R., G ́omez-P ́erez, A.:Loupe - an online tool for inspecting datasets in the
Linked Data cloud. In: ISWCPosters & Demos. vol. 1486. CEUR-WS.org (2015)
4. Pietriga, E., G ̈oz ̈ukan, H., Appert, C., Destandau, M., Cebiric, S., Goasdou ́e, F.,Manolescu, I.: Browsing linked data catalogs with
LODAtlas. In: ISWC. vol. 11137,pp. 137–153. Springer (2018)
5. Spahiu, B., Porrini, R., Palmonari, M., Rula, A., Maurino, A.: ABSTAT: ontology-driven linked data summaries with pattern minimalization. In:
ESWC Satellite Events, Revised Selected Papers. vol. 9989, pp. 381–395 (2016)
6. Schaible, J., Gottron, T., Scherp, A.: TermPicker: Enabling the reuse of vocabulary terms by exploiting data from the Linked Open Data
cloud. In: ESWC. vol. 9678,pp. 101–117. Springer (2016)
7. Ciglan, M., Nørv ̊ag, K., Hluch ́y, L.: The SemSets model for ad-hoc semantic list search. In: WWW. pp. 131–140. ACM (2012)
8. Gottron, T., Scherp, A., Krayer, B., Peters, A.: LODatio: using a schema-level index to support users in finding relevant sources of Linked
Data. In: K-CAP. pp.105–108. ACM (2013)
9. Konrath, M., Gottron, T., Staab, S., Scherp, A.: SchemEX - efficient construction of a data catalogue by stream-based indexing of Linked
Data. J. Web Sem.16,52–58 (2012)
10. Goasdou ́e, F., Guzewicz, P., Manolescu, I.: Incremental structural summarization of RDF graphs. In: EDBT. pp. 566–569.
OpenProceedings.org (2019)
13

Indexing data on the web a comparison of schema level indices for data search

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Indexing data on the web a comparison of schema level indices for data search

Similar to Indexing data on the web a comparison of schema level indices for data search (20)

Recently uploaded

Recently uploaded (20)

Indexing data on the web a comparison of schema level indices for data search