1. Where is my URI?
Andre Valdestilhas1
Tommaso Soru1
Markus Nentwig2
Edgard Marx3
Muhammad Saleem1
Axel-Cyrille Ngonga Ngomo1,4
1
AKSW Group, University of Leipzig, Germany
2
Database Group, University of Leipzig, Germany
3
Leipzig University of Applied Sciences, Germany
4
Data Science Group, Paderborn University, Germany
June 6, 2018
Valdestilhas et al. (AKSW) Where is my URI? June 6, 2018 1 / 21
5. Motivation
Use cases
Why do we need to know the URI Dataset?
Data quality in Link Repositories
Regenerate mappings using the CBDs to reapply link discovery algorithms in
order to validate the mappings.
Part of LinkLion 2.0
Valdestilhas et al. (AKSW) Where is my URI? June 6, 2018 5 / 21
7. Motivation
Use cases
Federated Query Processing
Query planning and Source (dataset) selection
WIMU will find relevant sources against the individual triple patterns of a
given SPARQL query
SELECT ?v1 ?v2 WHERE {
?uri <p1> ?v1. // Triple Pattern 1 (TP1)
?uri <p2> ?v2. // Triple Pattern 2 (TP2)
}
Source selection
RDF
S1
RDF
S2
RDF
S3
RDF
S4
WIMU source selection
TP1 = {S1,15}, {S2,9}, {S3,7}
TP2 = {S1,20}, {S2,11}, {S4,10}
Total selected sources = 6
Valdestilhas et al. (AKSW) Where is my URI? June 6, 2018 7 / 21
8. Approach
Goal
Index URIs and their use to enable Linked Data consumers to find relevant RDF
data sources
Rank the datasets proportionally to the number of literals
Keep the provenance of the URI
Valdestilhas et al. (AKSW) Where is my URI? June 6, 2018 8 / 21
9. Approach
Goal
Index URIs and their use to enable Linked Data consumers to find relevant RDF
data sources
Rank the datasets proportionally to the number of literals
Keep the provenance of the URI
Valdestilhas et al. (AKSW) Where is my URI? June 6, 2018 8 / 21
10. Approach
Steps to create the index
WIMU Service
URI Dataset Score
http://... dataset1 3
http://... dataset2 1
http://... datasetN n
Dumps & Endpoints
LOD Laundromat
HDT files
URI & Dataset & score
Index
Retrieve list of
datasets from
sources.
Retrieve data from
datasets.
SELECT ?s (count (?o) as ?c)
WHERE {
?s [] ?o . FILTER(isliteral (?o))
} GROUP BY ?s
1
2
3
Make the index
available and browsable
via a web application
and an API service.
4
5
Build three indexes
from the datasets,
building the rank
Merge the indexes
into one.
Valdestilhas et al. (AKSW) Where is my URI? June 6, 2018 9 / 21
11. Approach
The interface (Web and Json)
curl "http://wimu.aksw.org/Find?top=5&uri=http://dbpedia.org/resource/Leipzig"
[{"dataset":"http://downloads.dbpedia.org/2016-10/core-i18n/en/infobox_properties_en.ttl.bz2","CountLiteral":"236"},
{"dataset":"http://gaia.infor.uva.es/hdt/dbpedia2015.hdt.gz","CountLiteral":"165"},
{"dataset":"http://download.lodlaundromat.org/a9dabf348fd6262edfbbcf7256b0f839?type=hdt","CountLiteral":"142"},
{"dataset":"http://download.lodlaundromat.org/dde1dcc095b38a1b65ebfbc7696d7998?type=hdt","CountLiteral":"124"}]
Valdestilhas et al. (AKSW) Where is my URI? June 6, 2018 10 / 21
12. Approach
Usage
Service: https://wimu.aksw.org/Find
Parameter Default Description
top 0 Top occurrences of the datasets
uri – URI expected to search
link – URL from a linkset
cbd – The URI that will be the origin the CBD
ds – URL to download the dataset
Valdestilhas et al. (AKSW) Where is my URI? June 6, 2018 11 / 21
13. Approach
Examples
Single URI, top 5 datasets
https://wimu.aksw.org/Find?top=5&uri=http:
//dbpedia.org/resource/Leipzig
Linkset
https://wimu.aksw.org/Find?link=http://www.linklion.org/download/
mapping/sws.geonames.org---purl.org.nt
CBD
http://wimu.aksw.org/Find?cbd=http%3A%2F%2Fciteseer.rkbexplorer.
com%2Fid%2Fresource-CS116606&ds=http://download.lodlaundromat.
org/b7081efa178bc4ab3ff3a6ef5abac9b2?type=hdt
Valdestilhas et al. (AKSW) Where is my URI? June 6, 2018 12 / 21
14. Approach
Relevant points
Rank the datasets from LODStats and LODLaundromat using a score
function
is able to process linksets (more than one URI per request)
Valdestilhas et al. (AKSW) Where is my URI? June 6, 2018 13 / 21
15. Experiments and results
Experimental Setup
Hardware: 64 CPU cores, 126 GB RAM, 2 TB hard disk.
Creation of the index: 3 days and 7 hours.
Valdestilhas et al. (AKSW) Where is my URI? June 6, 2018 14 / 21
16. Experiments and results
The heuristic behind the Literals
A sample of 100 DBpedia URIs and occurrences of the top-1 datasets
according to the number of Literals
DBpedia:LODLaundromat::0d75803ad1faf47355a2656821d598bb?type=hdt
DBpedia:LODLaundromat::1fdf94a6a8ec729410fe99da03f721d9?type=hdt
DBpedia:LODLaundromat::4579a1f4a4637037f1995717b96f7bb5?type=hdt
DBpedia:LODLaundromat::7797f921760e5d4a02ee6eca37c951ec?type=hdt
DBpedia:LODLaundromat::e8f74020ae3cf760c34af2b20c4451c4?type=hdt
Dbpedia:2016:infobox_properties_en.ttl.bz2
DBpedia:RDFHDT:dbpedia2015.hdt.gz
0 10 20 30 40 50 60 70
ype=hdt
ype=hdt
ype=hdt
ype=hdt
ype=hdt
en.ttl.bz2
15.hdt.gz
Datasets
Manually checked: 100% precision
Valdestilhas et al. (AKSW) Where is my URI? June 6, 2018 15 / 21
17. Experiments and results
The heuristic behind the Literals
Top 100 datasets from http://dbpedia.org/resource/Leipzig
0
50
100
150
200
250
Literals
Valdestilhas et al. (AKSW) Where is my URI? June 6, 2018 16 / 21
18. Experiments and results
The heuristic behind the Literals
Rank top 5 datasets
D1 D2 D3 D4 D5
0
10
20
30
40
50
60
70
80
Literals
http://dbpedia.org/resource/Sandersleben
D6 D7 D8 D9 D10
0
20
40
60
80
100
120
140
160
180
200
220
240
Literals
http://dbpedia.org/resource/Leipzig
D1 http://km.aifb.kit.edu/projects/btc-2009/btc-2009-chunk-101.gz
D2 http://km.aifb.kit.edu/projects/btc-2012/dbpedia/data-0.nq.gz
D3 http://gaia.infor.uva.es/hdt/dbpedia2015.hdt.gz
D4 http://data.dws.informatik.uni-mannheim.de/dbpedia/3.8/ru/infobox_properties_en_uris_ru.nt.bz2
D5 http://data.dws.informatik.uni-mannheim.de/dbpedia/3.9/ru/raw_infobox_properties_en_uris_ru.nt.bz2
D6 http://downloads.dbpedia.org/2016-10/core-i18n/en/infobox_properties_en.ttl.bz2
D7 http://gaia.infor.uva.es/hdt/dbpedia2015.hdt.gz
D8 http://km.aifb.kit.edu/projects/btc-2009/btc-2009-chunk-061.gz
D9 http://data.dws.informatik.uni-mannheim.de/dbpedia/2014/en/infobox_properties_unredirected_en.nq.bz2
D10 http://data.dws.informatik.uni-mannheim.de/dbpedia/2014/en/infobox_properties_en.nq.bz2
Valdestilhas et al. (AKSW) Where is my URI? June 6, 2018 17 / 21
19. Experiments and results
Datasets
LOD Laundromat LODStats Total
URIs indexed 4,185,133,445 31,121,342 4,216,254,787
Datasets checked 658,206 9,960 668,166
Triples processed 19,891,702,202 38,606,408,854 58,498,111,056
LODStats
60% offline, 14% empty
8% triples with literals as objects are blank nodes
35% online datasets present some errors using Jena
69.8% datasets with parser errors
LODLaundromat
2.3% parsing errors. 99% indexed by WIMU
Valdestilhas et al. (AKSW) Where is my URI? June 6, 2018 18 / 21
20. Experiments and results
Datasets
Content negotiation error from Content-type:
text/html.
Error making the query, see cause for details
null
Service URL overrides the 'query' SPARQL
protocol paramenter
XMLStreamException: ParseError ar [row,col]:
[-8251091,8]
Failed when initializing the StAX parsing engine
JENA parsing errors
Not all datasets are ready to use
Valdestilhas et al. (AKSW) Where is my URI? June 6, 2018 19 / 21
21. Conclusions and Future work
A regularly updated database index of more than 660K datasets from
LODStats and LODLaundromat
An efficient service on the web that can determine which dataset will most
likely define a URI
Various statistics of datasets indexed from LODStats and LODLaundromat
Future work: Integrate the second version of LinkLion
http://www.linklion.org
Valdestilhas et al. (AKSW) Where is my URI? June 6, 2018 20 / 21
22. That’s all Folks!
Thanks!
Questions?Website: http://wimu.aksw.org/
Contact: valdestilhas@informatik.uni-leipzig.de
This work was supported by grants from the EU H2020 Framework Programme
provided for the project HOBBIT (GA no. 688227).
Valdestilhas et al. (AKSW) Where is my URI? June 6, 2018 21 / 21