From Data to Knowledge - Profiling & Interlinking Web Datasets

From data to knowledge –
profiling and interlinking Web datasets
Stefan Dietze
L3S Research Center
31/07/14 1Stefan Dietze

Recent work on Linked Data exploration/discovery/search
 Entity interlinking & dataset interlinking recommendation
 Dataset profiling
 Data consistency & conflicts
Research areas
 Web science, Information Retrieval, Semantic Web & Linked
Data, data & knowledge integration (mapping, classification,
interlinking)
 Application domains: education/TEL, Web archiving, …
Some projects
Introduction
http://www.l3s.de/
31/07/14 2
 See also: http://purl.org/dietze
Stefan Dietze

…why are there so few datasets actually used?
 Date reuse and in-links focused on trusted „reference
graphs“ such as DBpedia, Freebase etc
 Long tail of LD datasets which are neither reused nor linked
to (LOD Cloud alone 300+ datasets, 50 bn triples)
 Explanations?
Linked Data is awesome, but...
31/07/14
 „HTTP-accessibility“
(SPARQL, URI-dereferencing)
 „Structure“ & „Semantics“
(=> shared/linked vocabularies)
 „Interlinked“
 „Persistent“
Hm,
really?
Stefan Dietze

Linked data is more diverse than we think
SPARQL endpoint availability over time [Buil-Aranda et al 2013]
Accessibility of datasets?
 Less than 50% of all SPARQL endpoints
actually responsive at given point of time
 “THE” SPARQL protocol?
No, but many variants & subsets
 …
“Semantics”, links, quality?
 …data consistency? [? Yuan2014 ?]
 …data accuracy (eg DBpedia)?
[Paulheim2013]
 …vocabulary reuse? [D’AquinWebSci13]
 …schema compliance (RDFS, schemas)
[HoganJWS2012]
Stefan Dietze
SPARQL Web-Querying Infrastructure: Ready for Action?,
Carlos Buil-Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves
Vandenbussch, International Semantic Web Conference 2013,
(ISWC2013).
Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A.,
Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013.
Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC
2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525
An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth,
A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012

Too many/diverse datasets, too little knowledge
Stefan Dietze 31/07/14
?
?
? ?? ?
 Which datasets are useful & trustworthy for case
XY (eg „learning about the solar system“) ? Which
topics are covered?
 Types: which datasets describe statistics, videos,
slides, publications etc?
 Currentness, dynamics, accessability/reliability,
data quantity & quality?

db:Astro. Objects
Dataset
Metadata
BIBO
AAISO
FOAF
contains
Entity disambiguation &
linking [ESWC13]
Topic profile extraction
[WWW13, ESCW14]
db:Astronomy
db:Astro. Objects
Dataset
Catalog/Registry
yov:Video
po:Programme
BBC Programme
<po:Programme …>
<po:Series>Wonders of the Solar System</.>
<po:Actor>Brian Cox</…>
</po:Programme…>
<yo:Video …>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video…>
Yovisto Video
bibo:Fil
bibo:Fi
bibo:Film
Schema mappings
[WebSci13]
Data curation, linking and dataset profiling

Schemas/vocabularies on the Web: XKCD 927
https://xkcd.com/927/
schemas & vocabularies

typeX
typeX
Schema assessment and mapping
Co-occurence of
data types
(in 146 datasets:
144 Vocabularies,
588 highly
overlapping types,
719 Properties)
Co-occurence after
mapping into most
frequent schemas
(201 frequent types
mapped into 79
classes)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
bibo:Film
bibo:Document
po:Programme sioc:Item
31/07/14
foaf:Document
yov:Video
typeX

LinkedUp Data Catalog
in a nutshell http://datahub.io/group/linked-education
http://data.linkededucation.org/linkedup/catalog/
 RDF (VoID) dataset catalog: browse &
query distributed datasets
 Live information about endpoint
accessibility
 Federated queries using type mappings
http://datahub.io/group/linked-education

31/07/14
Dataset interlinking recommendation
Candidate datasets for interlinking?
13
t
Linkset1
Linkset2
Approach
 Given dataset t, ranking datasets from D
according to probability score (di, t) to
contain linking candidates (entities)
 Features:
 Vocabulary overlap
 Existing links (SNA)
 Linking candidates likely if datasets share
common (a) schema elements, or (b) links
(friend of a friend)
Conclusions
 Roughly 60% MAP for both approaches
 Future work: quantity of links, extraction
of experimental data from datasets…
Lopes, G.R., Paes Leme, L.A.P., Nunes, B.P., Casanova, M.A.,
Dietze, S., Recommending Tripleset Interlinking through a
Social Network Approach, The 14th International Conference
on Web Information System Engineering (WISE 2013),
Nanjing, China, 2013.
Paes Leme, L. A. P., Lopes, G. R., Nunes, B. P., Casanova,
M.A., Dietze, S., Identifying candidate datasets for data
interlinking, in Proceedings of the 13th International
Conference on Web Engineering, (2013).
Rank
1 DBLP
2 ACM
3 OAI
4 CiteSeer
5 IBM
6 Roma
7 IEEE
8 Ulm
9 Pisa
?
?
Stefan Dietze

<yo:Video 8748720>
…
</yo:Video 8748720>
Video
Topics/categories addressed?
Relatedness of resources/entities?
(types, semantics)
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
Combining a co-occurrence-based and a semantic measure
for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R.
Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended
Semantic Web Conference, (May 2013).
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B., Dietze, S.,
Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended
Semantic Web Conference (ESWC2014), Crete, Greece, (2014).
Dataset & entity linking: semantics of resources/datasets?
16Stefan Dietze 31/07/14
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
Pluto?

db:Pluto
(Dwarf
Planet)
db:Astrono-
mical Objects
db:Sun
Disambiguation/linking using background knowledge
„Semantic relatetedness“ of resources?
db:Astronomy
17
System</po:Series>
Programme
<sioc:Item 2139393292>
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
Video
<yo:Video 8748720>
…
</yo:Video 8748720>

db:Pluto
(Dwarf
Planet)
db:Astrono-
mical Objects
db:Astronomy
 Computation of connectivity scores
between resources/entities
 Method: combination of a
 (i) semantic (graph-based) connectivity
score (SCS) with
 (ii) a Web co-occurence-based measure
(CBM) (similar to NGD)
 For (i): adaptation of Katz-Index from SNA
for (linked) data graphs (considering path
number and path lengths of transversal
properties)
db:Sun
SCS = 0.32
CBM = 0.24
http://purl.org/vol/doc/
http://purl.org/vol/ns/
Combining a co-occurrence-based and a semantic
measure for entity linking, B. P. Nunes, S. Dietze, M.A.
Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013
- 10th Extended Semantic Web Conference, (May 2013).
Entity linking: semantic relatedness
<sioc:Item 2139393292>
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
System</po:Series>
Programme
<yo:Video 8748720>
…
</yo:Video 8748720>
Video

Entity linking: evaluation
 Evaluation based on USA Today News items (80.000 entity pairs)
 Manually created gold standard
(1000 entity pairs)
 Baseline: Explicit Semantic Analysis (ESA)
=> CBM/SCS: „relatedness“; ESA: „similarity“
Precision/Recall/F1 for SCS, CBM, ESA.
Combining a co-occurrence-based and a semantic
measure for entity linking, B. P. Nunes, S. Dietze, M.A.
Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013
- 10th Extended Semantic Web Conference, (May 2013).

db:Astro. Objects
Dataset
Metadata
Entity disambiguation &
linking [ESWC13]
Topic profile extraction
[WWW13, ESCW14]
db:Astronomy
db:Astro. Objects
Dataset
Catalog/Registry
yov:Video
<yo:Video …>
…
</yo:Video…>
Yovisto Video
 Extracting representative metadata („topic profile“) for datasets
 Ranking of most representative (DBpedia) categories (= topics); applied
to all responsive LOD datasets
 Scalability vs representativeness: sampling & ranking for good
scalability/accuracy balance
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
Dataset profiling: what‘s the data about?

Dataset profiling: approach
1. Sampling of resource instances
(random sampling, weighted sampling, resource
centrality sampling)
2. Entity and topic extraction (NER via DBpedia
Spotlight, category mapping and expansion)
3. Normalisation and ranking (using graphical-
models such as PageRank with Priors, HITS with
Priors and K-Step Markov)
=> Result: weighted dataset-topic profile graph
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).

Dataset profiling: exploring LOD datasets/topics
in a nutshell http://data-observatory.org/lod-profiles/
 Automatic extraction of dataset “topics” [ESWC2014]
=> RDF/VoiD dataset profiles
 Visualisation & exploration of dataset-topic graph
(datasets, topics, relationships)
 Includes all (responsive) datasets of LOD Cloud

Dataset profiling: results evaluation
NDCG (averaged over all datasets) .
Datasets & Ground Truth
 Yovisto, Oxpoints, LAK Dataset, Semantic Web
Dogfood
 Crowd-sourced topic indicators from datasets
(keywords, tags)
 Manual mapping to entities & category extraction
(ranking according to frequency)
Baselines
 1) LDA, 2) tf/idf (applied to entire datasets)
 Topic extraction according to our approach,
weighting/ranking based on term weight
Measure
 NDCG @ rank l
 Performance (time/NDCG) for different sampling
strategies/sizes etc

31/07/14
dbp:Category:Royal_Medal_winners
dbp:Category:1955_births
dbp:Category:People_from_London
dbp:Category:Buzzwords
dbp:Category:Web_Services
dbp:Category:HTTP
dbp:Category:Unitarian_Universalists
dbp:Category:World_Wide_Web
What have these categories in common?
Stefan Dietze

31/07/14
Diversity of category profile for a single paper
Berners-Lee, Tim; Hendler, James, Ora Lassila (2001). "The Semantic Web".
Scientific American Magazine.
person
document
dbp:Tim_Berners-Lee
dbp:Category:1955_births
dbp:Category:People_from_London
dbp:Category:Buzzwords
dbp:Semantic_Web
dbp:Category:Semantic_Web
dbp:Category:Web_Services
dbp:Category:HTTP
dbp:Category:Unitarian_Universalists
first-level categories (dcterms:subject)
dbp:Category:World_Wide_Web
dbp:Category:Royal_Medal_winners
Stefan Dietze

31/07/14
http://data-observatory.org/led-explorer/
 Type specific views on datasets/
categories
 “Document” (foaf:document)
 “Person “ (foaf:person)
 “Course” (aaiso:course)
 Currently applied to datasets in
LinkedUp Catalog only (as
schema mappings already
available here)
Type-specific exploration of dataset categories
Stefan Dietze

May –September 2013 October 2013 – May 2014 May 2014 – October 2014
Series of Open Data Competitions to promote applications which exploit Linked Open Data
http://www.linkedup-challenge.org/
http://www.linkedup-project.eu/
LinkedUp Challenge: Linking Web Data (for Education)
 “Vici” open just now
 Final events at ISWC2014
 Submission: 5 September

Conclusions & future work
Summary
 Increasing amounts of data => require knowledge about
nature and relationships of datasets
 Profiling: scalable methods for extracting dataset metadata
 Interlinking: connectivity of entities or datasets
Future work – LD evolution, preservation, consistency
 In RDF graphs (eg LOD Cloud), „all“ nodes are connected
 LD preservation: which datasets to preserve (entity
„neighbourhood“)? => semantic relatedness
 Link correctness in evolving LD: investigating impact of
changes on link correctness
 Application: informed preservation and enrichment strategies

Thank you!
WWW
See also (general)
 http://linkedup-project.eu
 http://linkededucation.org
 http://data.l3s.de
http://purl.org/dietze
See also (data)
 http://data.linkededucation.org
 http://data.linkededucation.org/linkedup/catalog/
 http://lak.linkededucation.org
 Besnik Fetahu (L3S)
 Elena Demidova (L3S)
 Bernardo Pereira Nunes (PUC Rio)
 Marco Casanova (PUC Rio)
 Luiz Andre Paes Leme (PUC Rio)
 Giseli Lopes (PUC Rio)
 Davide Taibi (CNR, IT)
 Mathieu d’Aquin (Open University, UK)
 and many more…
Acknowledgements

From Data to Knowledge - Profiling & Interlinking Web Datasets

Recommended

Recommended

More Related Content

Similar to From Data to Knowledge - Profiling & Interlinking Web Datasets

Similar to From Data to Knowledge - Profiling & Interlinking Web Datasets (20)

More from Stefan Dietze

More from Stefan Dietze (20)

Recently uploaded

Recently uploaded (20)

From Data to Knowledge - Profiling & Interlinking Web Datasets