SlideShare a Scribd company logo
1 of 13
Download to read offline
Indexing Data on the Web: A Comparison
of Schema-level Indices for Data Search
Till Blume1
and Ansgar Scherp2
DEXA 2020, 16th September, Virtual Event
1
Kiel University, Germany
2
Ulm University, Germany
…
Data Source Search with Graph Indices
Index
1
foaf:Agent
dct:subject bibo:Book
dct:creator
3
2
2
Towards a clean air policy
dct:creator
Great Britain. Central Electricity
foaf:Agent
URI-1 URI-2
bibo:Book
dct:subjectURI-3
Graph Indices
● General Purpose Graph indices: path, tree, and subgraph indices.
● Instance-level indices: find specific data instances, e.g., the book “Towards a
clean air policy”.
● Schema-level indices: find all data instances of type “Book”.
3
Problem Definition & Research Question
● Schema-level indices (SLIs) are designed, implemented, and evaluated for
their individual task only, using different queries, datasets, and metrics.
● Limited work has been done on understanding them in different contexts.
● Our hypothesis is that there is no SLI model that fits all tasks and
that the performance of the specific SLI depends on the specific types
of queries and characteristics of the datasets.
4
Contribution
1. We conduct an extensive empirical evaluation of representative SLI models.
To this end, we have for the first time defined and implemented the features of
existing SLI models in a common framework available on GitHub.
2. We evaluate Schema-level Indices independent of their original application for
a data search scenario.
3. We extend the Schema-level Index SchemEX [9] with RDF Schema
inferencing and owl:sameAs inferencing.
5
Computing the Schema-level Index SchemEX
RDF data graphSchema-level index
Book
i-1 i-2
Subject
i-3
Agent
p-1
p-2
Schema Element SE-1
Book
s-1 o-2
Subject
o-3
Agent Source ds-4
p-1
p-2
Data Instance Is-1
Book
s-91 o-92
Subject
o-93
Agent Source ds-7
p-1
p-2
Data Instance Is-91
6
?x
p1
pl
pl+1
pn
… …
?x
?y1
?yn
…
p1
pn
(a) Characteristic Sets [1] (c) Sem Sets [7]
p1
pl
pl+1
pn
… …
……
?x
(b) Weak Property Cliques [10]
?x
?y1
?yn
…
p1
pn
?c1
?cm
…
rdf:type
?c1
rdf:type
?cm
…
?c1
?cm
rdf:type…
(d) LODex [2], Loupe [3],
ABSTAT [5], SchemEX [9]
?yi
?c1
?cm
…?x
?c1
?cm
rdf:type… rdf:type
p1
∪ … ∪ pn
(e) Term Picker [6]
Analysis of representative SLIs
7
Experimental Setup
Implementation of a single, parameterized algorithm 1
Datasets crawled from the Web of Data:
● TimBL-11M: 11 million triples crawl starting from a single seed URI
● DyLDo-127M: 127 million triples crawl starting from 95k seed URIs
Index Models:
● Characteristic Sets, SemSets, Weak Property Cliques, SchemEX, SchemEX+U+I
● Use sub-graph sizes of k = 0, 1, and 2 hop distances for each model
1. https://github.com/t-blume/fluid-framework 8
Experiment 1: Compression and Summarization
9
Experiment 2: Stream-based Index Computation
● scalable computation by observing the graph over a stream of edges with a fixed window size
● approximation errors due to incomplete data
● evaluate the F1-score of data search queries for approximate indices compared to exact indices
● change the window size (1k, 100k, and 200k instances)
Main results: the approximation quality of an index computed in a stream-based
approach depends on three factors:
a. characteristics of the crawled dataset
b. simple queries consistently outperform complex queries
c. a larger window size typically improves the quality only marginally
10
Main Observations
1. Significant correlation between compression ratio and summarization ratio
(intuition: superior summarization -> superior compression)
2. Significant negative correlation between summarization ratio and
approximation quality (intuition: superior summarization -> superior
approximation quality)
3. Dataset characteristic is important: Over all indices, indices have on average
a 0.15 lower F1-score on the DyLDO-127M dataset compared to the
TimBL-11M dataset.
11
Conclusion
● Huge variations in compression ratio, summarization ratio, and approximation
quality for different Schema-level Indices, queries, and datasets.
● Schema-level indices developed for only a specific application can also be
beneficial in different applications, e.g., Characteristic Sets or TermPicker.
● Inferencing over RDF Schema can lead to a superior summarization ratio.
Please find on GitHub our source code, the queries, and the detailed statistics
about the TimBL-11M and DyLDO-127M datasets:
http://github.com/t-blume/fluid-framework
12
References
1. Neumann, T., Moerkotte, G.: Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In: ICDE. pp. 984–994.
IEEE (2011)
2. Benedetti, F., Bergamaschi, S., Po, L.: Exposing the underlying schema of LOD sources. In: Joint IEEE/WIC/ACM WI and IAT. pp. 301–304.
IEEE (2015)
3. Mihindukulasooriya, N., Poveda-Villal ́on, M., Garc ́ıa-Castro, R., G ́omez-P ́erez, A.:Loupe - an online tool for inspecting datasets in the
Linked Data cloud. In: ISWCPosters & Demos. vol. 1486. CEUR-WS.org (2015)
4. Pietriga, E., G ̈oz ̈ukan, H., Appert, C., Destandau, M., Cebiric, S., Goasdou ́e, F.,Manolescu, I.: Browsing linked data catalogs with
LODAtlas. In: ISWC. vol. 11137,pp. 137–153. Springer (2018)
5. Spahiu, B., Porrini, R., Palmonari, M., Rula, A., Maurino, A.: ABSTAT: ontology-driven linked data summaries with pattern minimalization. In:
ESWC Satellite Events, Revised Selected Papers. vol. 9989, pp. 381–395 (2016)
6. Schaible, J., Gottron, T., Scherp, A.: TermPicker: Enabling the reuse of vocabulary terms by exploiting data from the Linked Open Data
cloud. In: ESWC. vol. 9678,pp. 101–117. Springer (2016)
7. Ciglan, M., Nørv ̊ag, K., Hluch ́y, L.: The SemSets model for ad-hoc semantic list search. In: WWW. pp. 131–140. ACM (2012)
8. Gottron, T., Scherp, A., Krayer, B., Peters, A.: LODatio: using a schema-level index to support users in finding relevant sources of Linked
Data. In: K-CAP. pp.105–108. ACM (2013)
9. Konrath, M., Gottron, T., Staab, S., Scherp, A.: SchemEX - efficient construction of a data catalogue by stream-based indexing of Linked
Data. J. Web Sem.16,52–58 (2012)
10. Goasdou ́e, F., Guzewicz, P., Manolescu, I.: Incremental structural summarization of RDF graphs. In: EDBT. pp. 566–569.
OpenProceedings.org (2019)
13

More Related Content

What's hot

Using parallel hierarchical clustering to
Using parallel hierarchical clustering toUsing parallel hierarchical clustering to
Using parallel hierarchical clustering toBiniam Behailu
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...Nexgen Technology
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Robert Grossman
 
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...Robert Grossman
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)Robert Grossman
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Robert Grossman
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Robert Grossman
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
MINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMS
MINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMSMINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMS
MINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMSNexgen Technology
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismseSAT Journals
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesJason Hattrick-Simpers
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataRobert Grossman
 
Optique presentation
Optique presentationOptique presentation
Optique presentationDBOnto
 
Gray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark ApplicationsGray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark ApplicationsATMOSPHERE .
 
Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
 Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar... Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...Lviv Data Science Summer School
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 

What's hot (20)

Using parallel hierarchical clustering to
Using parallel hierarchical clustering toUsing parallel hierarchical clustering to
Using parallel hierarchical clustering to
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
MINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMS
MINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMSMINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMS
MINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMS
 
ML in materials discovery
ML in materials discovery ML in materials discovery
ML in materials discovery
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanisms
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop Slides
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
Optique presentation
Optique presentationOptique presentation
Optique presentation
 
Gray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark ApplicationsGray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark Applications
 
Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
 Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar... Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 

Similar to Indexing data on the web a comparison of schema level indices for data search

Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper PresentationShubham Singh
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSymeon Papadopoulos
 
A modified k means algorithm for big data clustering
A modified k means algorithm for big data clusteringA modified k means algorithm for big data clustering
A modified k means algorithm for big data clusteringSK Ahammad Fahad
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSotiris Beis
 
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...ssuser4b1f48
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLAnubhav Jain
 
IRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET Journal
 
論文サーベイ(Sasaki)
論文サーベイ(Sasaki)論文サーベイ(Sasaki)
論文サーベイ(Sasaki)Hajime Sasaki
 
On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...IJDKP
 
On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...IJDKP
 
03 interlinking-dass
03 interlinking-dass03 interlinking-dass
03 interlinking-dassDiego Pessoa
 
Smart Data for Behavioural Change: Towards Energy Efficient Buildings
Smart Data for Behavioural Change: Towards Energy Efficient BuildingsSmart Data for Behavioural Change: Towards Energy Efficient Buildings
Smart Data for Behavioural Change: Towards Energy Efficient BuildingsAnna Fensel
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET Journal
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...Till Blume
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.Giuseppe Ricci
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 

Similar to Indexing data on the web a comparison of schema level indices for data search (20)

Big Data and IOT
Big Data and IOTBig Data and IOT
Big Data and IOT
 
Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper Presentation
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
A modified k means algorithm for big data clustering
A modified k means algorithm for big data clusteringA modified k means algorithm for big data clustering
A modified k means algorithm for big data clustering
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
IRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using Cobweb
 
論文サーベイ(Sasaki)
論文サーベイ(Sasaki)論文サーベイ(Sasaki)
論文サーベイ(Sasaki)
 
Cognitive data
Cognitive dataCognitive data
Cognitive data
 
On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...
 
On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...
 
03 interlinking-dass
03 interlinking-dass03 interlinking-dass
03 interlinking-dass
 
Smart Data for Behavioural Change: Towards Energy Efficient Buildings
Smart Data for Behavioural Change: Towards Energy Efficient BuildingsSmart Data for Behavioural Change: Towards Energy Efficient Buildings
Smart Data for Behavioural Change: Towards Energy Efficient Buildings
 
Keynote at AImWD
Keynote at AImWDKeynote at AImWD
Keynote at AImWD
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 

Recently uploaded

GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 

Recently uploaded (20)

GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 

Indexing data on the web a comparison of schema level indices for data search

  • 1. Indexing Data on the Web: A Comparison of Schema-level Indices for Data Search Till Blume1 and Ansgar Scherp2 DEXA 2020, 16th September, Virtual Event 1 Kiel University, Germany 2 Ulm University, Germany
  • 2. … Data Source Search with Graph Indices Index 1 foaf:Agent dct:subject bibo:Book dct:creator 3 2 2 Towards a clean air policy dct:creator Great Britain. Central Electricity foaf:Agent URI-1 URI-2 bibo:Book dct:subjectURI-3
  • 3. Graph Indices ● General Purpose Graph indices: path, tree, and subgraph indices. ● Instance-level indices: find specific data instances, e.g., the book “Towards a clean air policy”. ● Schema-level indices: find all data instances of type “Book”. 3
  • 4. Problem Definition & Research Question ● Schema-level indices (SLIs) are designed, implemented, and evaluated for their individual task only, using different queries, datasets, and metrics. ● Limited work has been done on understanding them in different contexts. ● Our hypothesis is that there is no SLI model that fits all tasks and that the performance of the specific SLI depends on the specific types of queries and characteristics of the datasets. 4
  • 5. Contribution 1. We conduct an extensive empirical evaluation of representative SLI models. To this end, we have for the first time defined and implemented the features of existing SLI models in a common framework available on GitHub. 2. We evaluate Schema-level Indices independent of their original application for a data search scenario. 3. We extend the Schema-level Index SchemEX [9] with RDF Schema inferencing and owl:sameAs inferencing. 5
  • 6. Computing the Schema-level Index SchemEX RDF data graphSchema-level index Book i-1 i-2 Subject i-3 Agent p-1 p-2 Schema Element SE-1 Book s-1 o-2 Subject o-3 Agent Source ds-4 p-1 p-2 Data Instance Is-1 Book s-91 o-92 Subject o-93 Agent Source ds-7 p-1 p-2 Data Instance Is-91 6
  • 7. ?x p1 pl pl+1 pn … … ?x ?y1 ?yn … p1 pn (a) Characteristic Sets [1] (c) Sem Sets [7] p1 pl pl+1 pn … … …… ?x (b) Weak Property Cliques [10] ?x ?y1 ?yn … p1 pn ?c1 ?cm … rdf:type ?c1 rdf:type ?cm … ?c1 ?cm rdf:type… (d) LODex [2], Loupe [3], ABSTAT [5], SchemEX [9] ?yi ?c1 ?cm …?x ?c1 ?cm rdf:type… rdf:type p1 ∪ … ∪ pn (e) Term Picker [6] Analysis of representative SLIs 7
  • 8. Experimental Setup Implementation of a single, parameterized algorithm 1 Datasets crawled from the Web of Data: ● TimBL-11M: 11 million triples crawl starting from a single seed URI ● DyLDo-127M: 127 million triples crawl starting from 95k seed URIs Index Models: ● Characteristic Sets, SemSets, Weak Property Cliques, SchemEX, SchemEX+U+I ● Use sub-graph sizes of k = 0, 1, and 2 hop distances for each model 1. https://github.com/t-blume/fluid-framework 8
  • 9. Experiment 1: Compression and Summarization 9
  • 10. Experiment 2: Stream-based Index Computation ● scalable computation by observing the graph over a stream of edges with a fixed window size ● approximation errors due to incomplete data ● evaluate the F1-score of data search queries for approximate indices compared to exact indices ● change the window size (1k, 100k, and 200k instances) Main results: the approximation quality of an index computed in a stream-based approach depends on three factors: a. characteristics of the crawled dataset b. simple queries consistently outperform complex queries c. a larger window size typically improves the quality only marginally 10
  • 11. Main Observations 1. Significant correlation between compression ratio and summarization ratio (intuition: superior summarization -> superior compression) 2. Significant negative correlation between summarization ratio and approximation quality (intuition: superior summarization -> superior approximation quality) 3. Dataset characteristic is important: Over all indices, indices have on average a 0.15 lower F1-score on the DyLDO-127M dataset compared to the TimBL-11M dataset. 11
  • 12. Conclusion ● Huge variations in compression ratio, summarization ratio, and approximation quality for different Schema-level Indices, queries, and datasets. ● Schema-level indices developed for only a specific application can also be beneficial in different applications, e.g., Characteristic Sets or TermPicker. ● Inferencing over RDF Schema can lead to a superior summarization ratio. Please find on GitHub our source code, the queries, and the detailed statistics about the TimBL-11M and DyLDO-127M datasets: http://github.com/t-blume/fluid-framework 12
  • 13. References 1. Neumann, T., Moerkotte, G.: Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In: ICDE. pp. 984–994. IEEE (2011) 2. Benedetti, F., Bergamaschi, S., Po, L.: Exposing the underlying schema of LOD sources. In: Joint IEEE/WIC/ACM WI and IAT. pp. 301–304. IEEE (2015) 3. Mihindukulasooriya, N., Poveda-Villal ́on, M., Garc ́ıa-Castro, R., G ́omez-P ́erez, A.:Loupe - an online tool for inspecting datasets in the Linked Data cloud. In: ISWCPosters & Demos. vol. 1486. CEUR-WS.org (2015) 4. Pietriga, E., G ̈oz ̈ukan, H., Appert, C., Destandau, M., Cebiric, S., Goasdou ́e, F.,Manolescu, I.: Browsing linked data catalogs with LODAtlas. In: ISWC. vol. 11137,pp. 137–153. Springer (2018) 5. Spahiu, B., Porrini, R., Palmonari, M., Rula, A., Maurino, A.: ABSTAT: ontology-driven linked data summaries with pattern minimalization. In: ESWC Satellite Events, Revised Selected Papers. vol. 9989, pp. 381–395 (2016) 6. Schaible, J., Gottron, T., Scherp, A.: TermPicker: Enabling the reuse of vocabulary terms by exploiting data from the Linked Open Data cloud. In: ESWC. vol. 9678,pp. 101–117. Springer (2016) 7. Ciglan, M., Nørv ̊ag, K., Hluch ́y, L.: The SemSets model for ad-hoc semantic list search. In: WWW. pp. 131–140. ACM (2012) 8. Gottron, T., Scherp, A., Krayer, B., Peters, A.: LODatio: using a schema-level index to support users in finding relevant sources of Linked Data. In: K-CAP. pp.105–108. ACM (2013) 9. Konrath, M., Gottron, T., Staab, S., Scherp, A.: SchemEX - efficient construction of a data catalogue by stream-based indexing of Linked Data. J. Web Sem.16,52–58 (2012) 10. Goasdou ́e, F., Guzewicz, P., Manolescu, I.: Incremental structural summarization of RDF graphs. In: EDBT. pp. 566–569. OpenProceedings.org (2019) 13