SlideShare a Scribd company logo
1 of 13
Download to read offline
Indexing Data on the Web: A Comparison
of Schema-level Indices for Data Search
Till Blume1
and Ansgar Scherp2
DEXA 2020, 16th September, Virtual Event
1
Kiel University, Germany
2
Ulm University, Germany
…
Data Source Search with Graph Indices
Index
1
foaf:Agent
dct:subject bibo:Book
dct:creator
3
2
2
Towards a clean air policy
dct:creator
Great Britain. Central Electricity
foaf:Agent
URI-1 URI-2
bibo:Book
dct:subjectURI-3
Graph Indices
● General Purpose Graph indices: path, tree, and subgraph indices.
● Instance-level indices: find specific data instances, e.g., the book “Towards a
clean air policy”.
● Schema-level indices: find all data instances of type “Book”.
3
Problem Definition & Research Question
● Schema-level indices (SLIs) are designed, implemented, and evaluated for
their individual task only, using different queries, datasets, and metrics.
● Limited work has been done on understanding them in different contexts.
● Our hypothesis is that there is no SLI model that fits all tasks and
that the performance of the specific SLI depends on the specific types
of queries and characteristics of the datasets.
4
Contribution
1. We conduct an extensive empirical evaluation of representative SLI models.
To this end, we have for the first time defined and implemented the features of
existing SLI models in a common framework available on GitHub.
2. We evaluate Schema-level Indices independent of their original application for
a data search scenario.
3. We extend the Schema-level Index SchemEX [9] with RDF Schema
inferencing and owl:sameAs inferencing.
5
Computing the Schema-level Index SchemEX
RDF data graphSchema-level index
Book
i-1 i-2
Subject
i-3
Agent
p-1
p-2
Schema Element SE-1
Book
s-1 o-2
Subject
o-3
Agent Source ds-4
p-1
p-2
Data Instance Is-1
Book
s-91 o-92
Subject
o-93
Agent Source ds-7
p-1
p-2
Data Instance Is-91
6
?x
p1
pl
pl+1
pn
… …
?x
?y1
?yn
…
p1
pn
(a) Characteristic Sets [1] (c) Sem Sets [7]
p1
pl
pl+1
pn
… …
……
?x
(b) Weak Property Cliques [10]
?x
?y1
?yn
…
p1
pn
?c1
?cm
…
rdf:type
?c1
rdf:type
?cm
…
?c1
?cm
rdf:type…
(d) LODex [2], Loupe [3],
ABSTAT [5], SchemEX [9]
?yi
?c1
?cm
…?x
?c1
?cm
rdf:type… rdf:type
p1
∪ … ∪ pn
(e) Term Picker [6]
Analysis of representative SLIs
7
Experimental Setup
Implementation of a single, parameterized algorithm 1
Datasets crawled from the Web of Data:
● TimBL-11M: 11 million triples crawl starting from a single seed URI
● DyLDo-127M: 127 million triples crawl starting from 95k seed URIs
Index Models:
● Characteristic Sets, SemSets, Weak Property Cliques, SchemEX, SchemEX+U+I
● Use sub-graph sizes of k = 0, 1, and 2 hop distances for each model
1. https://github.com/t-blume/fluid-framework 8
Experiment 1: Compression and Summarization
9
Experiment 2: Stream-based Index Computation
● scalable computation by observing the graph over a stream of edges with a fixed window size
● approximation errors due to incomplete data
● evaluate the F1-score of data search queries for approximate indices compared to exact indices
● change the window size (1k, 100k, and 200k instances)
Main results: the approximation quality of an index computed in a stream-based
approach depends on three factors:
a. characteristics of the crawled dataset
b. simple queries consistently outperform complex queries
c. a larger window size typically improves the quality only marginally
10
Main Observations
1. Significant correlation between compression ratio and summarization ratio
(intuition: superior summarization -> superior compression)
2. Significant negative correlation between summarization ratio and
approximation quality (intuition: superior summarization -> superior
approximation quality)
3. Dataset characteristic is important: Over all indices, indices have on average
a 0.15 lower F1-score on the DyLDO-127M dataset compared to the
TimBL-11M dataset.
11
Conclusion
● Huge variations in compression ratio, summarization ratio, and approximation
quality for different Schema-level Indices, queries, and datasets.
● Schema-level indices developed for only a specific application can also be
beneficial in different applications, e.g., Characteristic Sets or TermPicker.
● Inferencing over RDF Schema can lead to a superior summarization ratio.
Please find on GitHub our source code, the queries, and the detailed statistics
about the TimBL-11M and DyLDO-127M datasets:
http://github.com/t-blume/fluid-framework
12
References
1. Neumann, T., Moerkotte, G.: Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In: ICDE. pp. 984–994.
IEEE (2011)
2. Benedetti, F., Bergamaschi, S., Po, L.: Exposing the underlying schema of LOD sources. In: Joint IEEE/WIC/ACM WI and IAT. pp. 301–304.
IEEE (2015)
3. Mihindukulasooriya, N., Poveda-Villal ́on, M., Garc ́ıa-Castro, R., G ́omez-P ́erez, A.:Loupe - an online tool for inspecting datasets in the
Linked Data cloud. In: ISWCPosters & Demos. vol. 1486. CEUR-WS.org (2015)
4. Pietriga, E., G ̈oz ̈ukan, H., Appert, C., Destandau, M., Cebiric, S., Goasdou ́e, F.,Manolescu, I.: Browsing linked data catalogs with
LODAtlas. In: ISWC. vol. 11137,pp. 137–153. Springer (2018)
5. Spahiu, B., Porrini, R., Palmonari, M., Rula, A., Maurino, A.: ABSTAT: ontology-driven linked data summaries with pattern minimalization. In:
ESWC Satellite Events, Revised Selected Papers. vol. 9989, pp. 381–395 (2016)
6. Schaible, J., Gottron, T., Scherp, A.: TermPicker: Enabling the reuse of vocabulary terms by exploiting data from the Linked Open Data
cloud. In: ESWC. vol. 9678,pp. 101–117. Springer (2016)
7. Ciglan, M., Nørv ̊ag, K., Hluch ́y, L.: The SemSets model for ad-hoc semantic list search. In: WWW. pp. 131–140. ACM (2012)
8. Gottron, T., Scherp, A., Krayer, B., Peters, A.: LODatio: using a schema-level index to support users in finding relevant sources of Linked
Data. In: K-CAP. pp.105–108. ACM (2013)
9. Konrath, M., Gottron, T., Staab, S., Scherp, A.: SchemEX - efficient construction of a data catalogue by stream-based indexing of Linked
Data. J. Web Sem.16,52–58 (2012)
10. Goasdou ́e, F., Guzewicz, P., Manolescu, I.: Incremental structural summarization of RDF graphs. In: EDBT. pp. 566–569.
OpenProceedings.org (2019)
13

More Related Content

What's hot

Using parallel hierarchical clustering to
Using parallel hierarchical clustering toUsing parallel hierarchical clustering to
Using parallel hierarchical clustering toBiniam Behailu
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...Nexgen Technology
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Robert Grossman
 
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...Robert Grossman
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)Robert Grossman
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Robert Grossman
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Robert Grossman
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
MINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMS
MINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMSMINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMS
MINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMSNexgen Technology
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismseSAT Journals
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesJason Hattrick-Simpers
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataRobert Grossman
 
Optique presentation
Optique presentationOptique presentation
Optique presentationDBOnto
 
Gray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark ApplicationsGray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark ApplicationsATMOSPHERE .
 
Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
 Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar... Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...Lviv Data Science Summer School
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 

What's hot (20)

Using parallel hierarchical clustering to
Using parallel hierarchical clustering toUsing parallel hierarchical clustering to
Using parallel hierarchical clustering to
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
MINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMS
MINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMSMINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMS
MINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMS
 
ML in materials discovery
ML in materials discovery ML in materials discovery
ML in materials discovery
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanisms
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop Slides
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
Optique presentation
Optique presentationOptique presentation
Optique presentation
 
Gray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark ApplicationsGray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark Applications
 
Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
 Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar... Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 

Similar to Indexing data on the web a comparison of schema level indices for data search

Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper PresentationShubham Singh
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSymeon Papadopoulos
 
A modified k means algorithm for big data clustering
A modified k means algorithm for big data clusteringA modified k means algorithm for big data clustering
A modified k means algorithm for big data clusteringSK Ahammad Fahad
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSotiris Beis
 
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...ssuser4b1f48
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLAnubhav Jain
 
IRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET Journal
 
論文サーベイ(Sasaki)
論文サーベイ(Sasaki)論文サーベイ(Sasaki)
論文サーベイ(Sasaki)Hajime Sasaki
 
On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...IJDKP
 
On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...IJDKP
 
03 interlinking-dass
03 interlinking-dass03 interlinking-dass
03 interlinking-dassDiego Pessoa
 
Smart Data for Behavioural Change: Towards Energy Efficient Buildings
Smart Data for Behavioural Change: Towards Energy Efficient BuildingsSmart Data for Behavioural Change: Towards Energy Efficient Buildings
Smart Data for Behavioural Change: Towards Energy Efficient BuildingsAnna Fensel
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET Journal
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...Till Blume
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?Paolo Missier
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.Giuseppe Ricci
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 

Similar to Indexing data on the web a comparison of schema level indices for data search (20)

Big Data and IOT
Big Data and IOTBig Data and IOT
Big Data and IOT
 
Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper Presentation
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
A modified k means algorithm for big data clustering
A modified k means algorithm for big data clusteringA modified k means algorithm for big data clustering
A modified k means algorithm for big data clustering
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
IRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using Cobweb
 
論文サーベイ(Sasaki)
論文サーベイ(Sasaki)論文サーベイ(Sasaki)
論文サーベイ(Sasaki)
 
Cognitive data
Cognitive dataCognitive data
Cognitive data
 
On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...
 
On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...
 
03 interlinking-dass
03 interlinking-dass03 interlinking-dass
03 interlinking-dass
 
Smart Data for Behavioural Change: Towards Energy Efficient Buildings
Smart Data for Behavioural Change: Towards Energy Efficient BuildingsSmart Data for Behavioural Change: Towards Energy Efficient Buildings
Smart Data for Behavioural Change: Towards Energy Efficient Buildings
 
Keynote at AImWD
Keynote at AImWDKeynote at AImWD
Keynote at AImWD
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 

Recently uploaded

Heads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdfHeads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdfbyp19971001
 
RACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptxRACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptxArunLakshmiMeenakshi
 
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptxPlasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptxmuralinath2
 
Quantifying Artificial Intelligence and What Comes Next!
Quantifying Artificial Intelligence and What Comes Next!Quantifying Artificial Intelligence and What Comes Next!
Quantifying Artificial Intelligence and What Comes Next!University of Hertfordshire
 
VILLAGE ATTACHMENT For rural agriculture PPT.pptx
VILLAGE ATTACHMENT For rural agriculture  PPT.pptxVILLAGE ATTACHMENT For rural agriculture  PPT.pptx
VILLAGE ATTACHMENT For rural agriculture PPT.pptxAQIBRASOOL4
 
Triploidy ...............................pptx
Triploidy ...............................pptxTriploidy ...............................pptx
Triploidy ...............................pptxCherry
 
Tuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notesTuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notesjyothisaisri
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
 
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...Sérgio Sacani
 
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Ansari Aashif Raza Mohd Imtiyaz
 
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 RpWASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 RpSérgio Sacani
 
ERTHROPOIESIS: Dr. E. Muralinath & R. Gnana Lahari
ERTHROPOIESIS: Dr. E. Muralinath & R. Gnana LahariERTHROPOIESIS: Dr. E. Muralinath & R. Gnana Lahari
ERTHROPOIESIS: Dr. E. Muralinath & R. Gnana Laharimuralinath2
 
Biochemistry and Biomolecules - Science - 9th Grade by Slidesgo.pptx
Biochemistry and Biomolecules - Science - 9th Grade by Slidesgo.pptxBiochemistry and Biomolecules - Science - 9th Grade by Slidesgo.pptx
Biochemistry and Biomolecules - Science - 9th Grade by Slidesgo.pptxjayabahari688
 
Abortion uae unmarried price +27791653574 Contact Us Dubai Abu Dhabi Sharjah ...
Abortion uae unmarried price +27791653574 Contact Us Dubai Abu Dhabi Sharjah ...Abortion uae unmarried price +27791653574 Contact Us Dubai Abu Dhabi Sharjah ...
Abortion uae unmarried price +27791653574 Contact Us Dubai Abu Dhabi Sharjah ...mikehavy0
 
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...yogeshlabana357357
 
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...Sahil Suleman
 
Information science research with large language models: between science and ...
Information science research with large language models: between science and ...Information science research with large language models: between science and ...
Information science research with large language models: between science and ...Fabiano Dalpiaz
 
Virulence Analysis of Citrus canker caused by Xanthomonas axonopodis pv. citr...
Virulence Analysis of Citrus canker caused by Xanthomonas axonopodis pv. citr...Virulence Analysis of Citrus canker caused by Xanthomonas axonopodis pv. citr...
Virulence Analysis of Citrus canker caused by Xanthomonas axonopodis pv. citr...TALAPATI ARUNA CHENNA VYDYANAD
 
NUMERICAL Proof Of TIme Electron Theory.
NUMERICAL Proof Of TIme Electron Theory.NUMERICAL Proof Of TIme Electron Theory.
NUMERICAL Proof Of TIme Electron Theory.syedmuneemqadri
 
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxSaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxPat (JS) Heslop-Harrison
 

Recently uploaded (20)

Heads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdfHeads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdf
 
RACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptxRACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptx
 
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptxPlasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
 
Quantifying Artificial Intelligence and What Comes Next!
Quantifying Artificial Intelligence and What Comes Next!Quantifying Artificial Intelligence and What Comes Next!
Quantifying Artificial Intelligence and What Comes Next!
 
VILLAGE ATTACHMENT For rural agriculture PPT.pptx
VILLAGE ATTACHMENT For rural agriculture  PPT.pptxVILLAGE ATTACHMENT For rural agriculture  PPT.pptx
VILLAGE ATTACHMENT For rural agriculture PPT.pptx
 
Triploidy ...............................pptx
Triploidy ...............................pptxTriploidy ...............................pptx
Triploidy ...............................pptx
 
Tuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notesTuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notes
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...
 
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
 
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 RpWASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
 
ERTHROPOIESIS: Dr. E. Muralinath & R. Gnana Lahari
ERTHROPOIESIS: Dr. E. Muralinath & R. Gnana LahariERTHROPOIESIS: Dr. E. Muralinath & R. Gnana Lahari
ERTHROPOIESIS: Dr. E. Muralinath & R. Gnana Lahari
 
Biochemistry and Biomolecules - Science - 9th Grade by Slidesgo.pptx
Biochemistry and Biomolecules - Science - 9th Grade by Slidesgo.pptxBiochemistry and Biomolecules - Science - 9th Grade by Slidesgo.pptx
Biochemistry and Biomolecules - Science - 9th Grade by Slidesgo.pptx
 
Abortion uae unmarried price +27791653574 Contact Us Dubai Abu Dhabi Sharjah ...
Abortion uae unmarried price +27791653574 Contact Us Dubai Abu Dhabi Sharjah ...Abortion uae unmarried price +27791653574 Contact Us Dubai Abu Dhabi Sharjah ...
Abortion uae unmarried price +27791653574 Contact Us Dubai Abu Dhabi Sharjah ...
 
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
 
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
 
Information science research with large language models: between science and ...
Information science research with large language models: between science and ...Information science research with large language models: between science and ...
Information science research with large language models: between science and ...
 
Virulence Analysis of Citrus canker caused by Xanthomonas axonopodis pv. citr...
Virulence Analysis of Citrus canker caused by Xanthomonas axonopodis pv. citr...Virulence Analysis of Citrus canker caused by Xanthomonas axonopodis pv. citr...
Virulence Analysis of Citrus canker caused by Xanthomonas axonopodis pv. citr...
 
NUMERICAL Proof Of TIme Electron Theory.
NUMERICAL Proof Of TIme Electron Theory.NUMERICAL Proof Of TIme Electron Theory.
NUMERICAL Proof Of TIme Electron Theory.
 
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxSaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
 

Indexing data on the web a comparison of schema level indices for data search

  • 1. Indexing Data on the Web: A Comparison of Schema-level Indices for Data Search Till Blume1 and Ansgar Scherp2 DEXA 2020, 16th September, Virtual Event 1 Kiel University, Germany 2 Ulm University, Germany
  • 2. … Data Source Search with Graph Indices Index 1 foaf:Agent dct:subject bibo:Book dct:creator 3 2 2 Towards a clean air policy dct:creator Great Britain. Central Electricity foaf:Agent URI-1 URI-2 bibo:Book dct:subjectURI-3
  • 3. Graph Indices ● General Purpose Graph indices: path, tree, and subgraph indices. ● Instance-level indices: find specific data instances, e.g., the book “Towards a clean air policy”. ● Schema-level indices: find all data instances of type “Book”. 3
  • 4. Problem Definition & Research Question ● Schema-level indices (SLIs) are designed, implemented, and evaluated for their individual task only, using different queries, datasets, and metrics. ● Limited work has been done on understanding them in different contexts. ● Our hypothesis is that there is no SLI model that fits all tasks and that the performance of the specific SLI depends on the specific types of queries and characteristics of the datasets. 4
  • 5. Contribution 1. We conduct an extensive empirical evaluation of representative SLI models. To this end, we have for the first time defined and implemented the features of existing SLI models in a common framework available on GitHub. 2. We evaluate Schema-level Indices independent of their original application for a data search scenario. 3. We extend the Schema-level Index SchemEX [9] with RDF Schema inferencing and owl:sameAs inferencing. 5
  • 6. Computing the Schema-level Index SchemEX RDF data graphSchema-level index Book i-1 i-2 Subject i-3 Agent p-1 p-2 Schema Element SE-1 Book s-1 o-2 Subject o-3 Agent Source ds-4 p-1 p-2 Data Instance Is-1 Book s-91 o-92 Subject o-93 Agent Source ds-7 p-1 p-2 Data Instance Is-91 6
  • 7. ?x p1 pl pl+1 pn … … ?x ?y1 ?yn … p1 pn (a) Characteristic Sets [1] (c) Sem Sets [7] p1 pl pl+1 pn … … …… ?x (b) Weak Property Cliques [10] ?x ?y1 ?yn … p1 pn ?c1 ?cm … rdf:type ?c1 rdf:type ?cm … ?c1 ?cm rdf:type… (d) LODex [2], Loupe [3], ABSTAT [5], SchemEX [9] ?yi ?c1 ?cm …?x ?c1 ?cm rdf:type… rdf:type p1 ∪ … ∪ pn (e) Term Picker [6] Analysis of representative SLIs 7
  • 8. Experimental Setup Implementation of a single, parameterized algorithm 1 Datasets crawled from the Web of Data: ● TimBL-11M: 11 million triples crawl starting from a single seed URI ● DyLDo-127M: 127 million triples crawl starting from 95k seed URIs Index Models: ● Characteristic Sets, SemSets, Weak Property Cliques, SchemEX, SchemEX+U+I ● Use sub-graph sizes of k = 0, 1, and 2 hop distances for each model 1. https://github.com/t-blume/fluid-framework 8
  • 9. Experiment 1: Compression and Summarization 9
  • 10. Experiment 2: Stream-based Index Computation ● scalable computation by observing the graph over a stream of edges with a fixed window size ● approximation errors due to incomplete data ● evaluate the F1-score of data search queries for approximate indices compared to exact indices ● change the window size (1k, 100k, and 200k instances) Main results: the approximation quality of an index computed in a stream-based approach depends on three factors: a. characteristics of the crawled dataset b. simple queries consistently outperform complex queries c. a larger window size typically improves the quality only marginally 10
  • 11. Main Observations 1. Significant correlation between compression ratio and summarization ratio (intuition: superior summarization -> superior compression) 2. Significant negative correlation between summarization ratio and approximation quality (intuition: superior summarization -> superior approximation quality) 3. Dataset characteristic is important: Over all indices, indices have on average a 0.15 lower F1-score on the DyLDO-127M dataset compared to the TimBL-11M dataset. 11
  • 12. Conclusion ● Huge variations in compression ratio, summarization ratio, and approximation quality for different Schema-level Indices, queries, and datasets. ● Schema-level indices developed for only a specific application can also be beneficial in different applications, e.g., Characteristic Sets or TermPicker. ● Inferencing over RDF Schema can lead to a superior summarization ratio. Please find on GitHub our source code, the queries, and the detailed statistics about the TimBL-11M and DyLDO-127M datasets: http://github.com/t-blume/fluid-framework 12
  • 13. References 1. Neumann, T., Moerkotte, G.: Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In: ICDE. pp. 984–994. IEEE (2011) 2. Benedetti, F., Bergamaschi, S., Po, L.: Exposing the underlying schema of LOD sources. In: Joint IEEE/WIC/ACM WI and IAT. pp. 301–304. IEEE (2015) 3. Mihindukulasooriya, N., Poveda-Villal ́on, M., Garc ́ıa-Castro, R., G ́omez-P ́erez, A.:Loupe - an online tool for inspecting datasets in the Linked Data cloud. In: ISWCPosters & Demos. vol. 1486. CEUR-WS.org (2015) 4. Pietriga, E., G ̈oz ̈ukan, H., Appert, C., Destandau, M., Cebiric, S., Goasdou ́e, F.,Manolescu, I.: Browsing linked data catalogs with LODAtlas. In: ISWC. vol. 11137,pp. 137–153. Springer (2018) 5. Spahiu, B., Porrini, R., Palmonari, M., Rula, A., Maurino, A.: ABSTAT: ontology-driven linked data summaries with pattern minimalization. In: ESWC Satellite Events, Revised Selected Papers. vol. 9989, pp. 381–395 (2016) 6. Schaible, J., Gottron, T., Scherp, A.: TermPicker: Enabling the reuse of vocabulary terms by exploiting data from the Linked Open Data cloud. In: ESWC. vol. 9678,pp. 101–117. Springer (2016) 7. Ciglan, M., Nørv ̊ag, K., Hluch ́y, L.: The SemSets model for ad-hoc semantic list search. In: WWW. pp. 131–140. ACM (2012) 8. Gottron, T., Scherp, A., Krayer, B., Peters, A.: LODatio: using a schema-level index to support users in finding relevant sources of Linked Data. In: K-CAP. pp.105–108. ACM (2013) 9. Konrath, M., Gottron, T., Staab, S., Scherp, A.: SchemEX - efficient construction of a data catalogue by stream-based indexing of Linked Data. J. Web Sem.16,52–58 (2012) 10. Goasdou ́e, F., Guzewicz, P., Manolescu, I.: Incremental structural summarization of RDF graphs. In: EDBT. pp. 566–569. OpenProceedings.org (2019) 13