Wu, Honghan, Boris Villazon-Terrazas, Jeff Z. Pan, and Jose Manuel Gomez-Perez. “How Redundant Is It? – An Empirical Analysis on Linked Datasets.” In ISWC COLD Workshop. 2014.
http://ceur-ws.org/Vol-1264/cold2014_WuVPG.pdf
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Redundancy analysis on linked data #cold2014 #ISWC2014
1. How redundant is it? – An empirical analysis on linked datasets
Honghan Wu1, Boris Villazon-Terrazas2, Jeff Z. Pan1 and José Manuel Gómez Pérez2
University of Aberdeen1, UK
iSOCO2 , Spain
20/10/2014 1
2. 2
Content
•
What is data redundancy with linked data?
•
Why is it of special interest to linked data consumption?
•
Linked Data redundancy categorisation
•
How to analysis?
•
Dataset selection & The Result
•
Conclusion
3. 3
What is the data redundancy in LD?
•
Data Redundancy
–
[Database systems] Same piece of data in multiple places
–
[Information theory] Wasted "space" used to transmit certain data
•
(In this work)Linked Data Redundancy
–
Wasted “space” to represent certain meaning (represented in certain semantics)
–
Duplication-free
4. 4
Why is it of special interest to LD consumption?
•
Bad Redundancy & Good Redundancy
–
Bad for exchange: storage, transmission
–
Good for inference computation
•
Relevant consumption tasks
–
Hosting/Sharing
–
Query Answering (SPARQL)
–
Ontology Based Data Access
–
Reasoning
5. Redundancy in Linked Data
•
Redundancy Categorisation for RDF Data
•
Redundancies caused by the “Linked” nature
6. 6
RDF Redundancies vs. Succinct Representations
[Rule based] A. K. Joshi, P. Hitzler, and G. Dong. Logical linked data compression. In The Semantic Web: Semantics and Big Data, pages 170–184. Springer, 2013.
[HDT]J. D. FernáNdez, M. A. MartíNez-Prieto, C. GutiéRrez, A. Polleres, and M. Arias. Binary rdf representation for publication and exchange (hdt). Web Semant., 19:22–41, Mar. 2013.
[WaterFowl] O. Curé, G. Blin, D. Revuz, and D. C. Faye. Waterfowl: A compact, self-indexed and inference-enabled immutable rdf store. In The Semantic Web: Trends and Challenges, pages 302– 316. Springer, 2014.
Pan, Jeff Z., Jose Manuel Gomez-Perez, Yuan Ren, Honghan Wu, Haofen Wang and Man Zhu. “Graph Pattern based RDF Data Compression”. In Proc. of 4th Joint International Semantic Technology Conference (JIST). 2014. (To appear)
9. 9
Symbolic Redundancy
•
http://xmlns.com/foaf/0.1/name
–
31 bytes in ASCII
URI
ID (4 bytes)
…
…
http://xmlns.com/foaf/0.1/name
128
…
…
Less bytes for basic data units
-
(Fix-length)Dictionary Based
-
(Variable-length) Huffman coding
-
Predictive encoding
10. 10
Semantic Redundancy Caused by “Linked” Nature
•
Vocabulary Linkage
–
Reuse of other vocabularies: more rules
–
Less redundancy ratio: more triples derivable
–
More redundancy: co-occurrence triples removable
•
Instance Linkage
–
sameAs linkages
–
Bring in new assertions (e.g., type assertions)
–
Bring in new axioms
11. How to analysis?
•
Two dimension analysis
•
Methodology
•
Metrics
17. 17
A-Box Only: Semantic Redundancies
– Redundant Triples
– Semantic redundancy ratio, i.e.
– # Graph Patterns used to substitute redundant triples
18. 18
A-Box Only: Syntactic Redundancies
– the redundant resource occurrences of inter-structural
redundancies
– the syntactic redundancy ratio, i.e.
19. 19
A-Box & T-Box: No Linkage
DBLP2013: SWRC ontology
Ordnance Survey: official published OS ontology
1.7%
184%
108%
4.7%
20. 20
A-Box & T-Box: No Linkage
First 3 datasts are reusing FOAF Ontology
– the number of directly used terms from reused T-Box
– the number of applicable axioms from (materialised) reused T-Box
26.9%
4%
45.4%
1.3%
21. 21
Conclusion
•
LOD redundancy are heterogeneous & huge
•
Vocabulary linkage might lead to huge number of derivable triples
•
Redundancy aware techniques are demanded
22. 22
Redundancy-aware Consumption
•
Compression: different redundancies might need different techniques
•
For Data Access: (high inter-structure redundancy) skewed entity distributions over EDPs -> efficient access?
•
OBDA/Reasoning: A-Box redundancy = less T-Box axioms
•
Data Publisher: should be aware of the consequences of reusing