Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
From data to knowledge –
profiling and interlinking Web datasets
Stefan Dietze
L3S Research Center
31/07/14 1Stefan Dietze
Recent work on Linked Data exploration/discovery/search
 Entity interlinking & dataset interlinking recommendation
 Data...
…why are there so few datasets actually used?
 Date reuse and in-links focused on trusted „reference
graphs“ such as DBpe...
Linked data is more diverse than we think
SPARQL endpoint availability over time [Buil-Aranda et al 2013]
Accessibility of...
Too many/diverse datasets, too little knowledge
Stefan Dietze 31/07/14
?
?
? ?? ?
 Which datasets are useful & trustworth...
db:Astro. Objects
Dataset
Metadata
Stefan Dietze 31/07/14
BIBO
AAISO
FOAF
contains
Entity disambiguation &
linking [ESWC13...
Schemas/vocabularies on the Web: XKCD 927
Stefan Dietze 31/07/14
https://xkcd.com/927/
schemas & vocabularies
typeX
typeX
Schema assessment and mapping
Co-occurence of
data types
(in 146 datasets:
144 Vocabularies,
588 highly
overla...
LinkedUp Data Catalog
in a nutshell http://datahub.io/group/linked-education
http://data.linkededucation.org/linkedup/cata...
31/07/14
Dataset interlinking recommendation
Candidate datasets for interlinking?
13
t
Linkset1
Linkset2
Approach
 Given ...
<yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video
Topics/categories addressed?...
db:Pluto
(Dwarf
Planet)
db:Astrono-
mical Objects
db:Sun
Disambiguation/linking using background knowledge
„Semantic relat...
db:Pluto
(Dwarf
Planet)
db:Astrono-
mical Objects
db:Astronomy
 Computation of connectivity scores
between resources/enti...
Entity linking: evaluation
31/07/14 19Stefan Dietze
 Evaluation based on USA Today News items (80.000 entity pairs)
 Man...
db:Astro. Objects
Dataset
Metadata
Stefan Dietze 31/07/14
Entity disambiguation &
linking [ESWC13]
Topic profile extractio...
Dataset profiling: approach
Stefan Dietze 31/07/14
1. Sampling of resource instances
(random sampling, weighted sampling, ...
Dataset profiling: exploring LOD datasets/topics
in a nutshell http://data-observatory.org/lod-profiles/
Stefan Dietze 31/...
Dataset profiling: results evaluation
Stefan Dietze 31/07/14
NDCG (averaged over all datasets) .
Datasets & Ground Truth
...
31/07/14
dbp:Category:Royal_Medal_winners
dbp:Category:1955_births
dbp:Category:People_from_London
dbp:Category:Buzzwords
...
31/07/14
Diversity of category profile for a single paper
Berners-Lee, Tim; Hendler, James, Ora Lassila (2001). "The Seman...
31/07/14
http://data-observatory.org/led-explorer/
 Type specific views on datasets/
categories
 “Document” (foaf:docume...
May –September 2013 October 2013 – May 2014 May 2014 – October 2014
Series of Open Data Competitions to promote applicatio...
Conclusions & future work
Summary
 Increasing amounts of data => require knowledge about
nature and relationships of data...
Thank you!
WWW
See also (general)
 http://linkedup-project.eu
 http://linkededucation.org
 http://data.l3s.de
http://p...
Upcoming SlideShare
Loading in …5
×

From Data to Knowledge - Profiling & Interlinking Web Datasets

776 views

Published on

Talk at KEYSTONE/ESWC meeting 2014

Published in: Technology
  • Be the first to comment

From Data to Knowledge - Profiling & Interlinking Web Datasets

  1. 1. From data to knowledge – profiling and interlinking Web datasets Stefan Dietze L3S Research Center 31/07/14 1Stefan Dietze
  2. 2. Recent work on Linked Data exploration/discovery/search  Entity interlinking & dataset interlinking recommendation  Dataset profiling  Data consistency & conflicts Research areas  Web science, Information Retrieval, Semantic Web & Linked Data, data & knowledge integration (mapping, classification, interlinking)  Application domains: education/TEL, Web archiving, … Some projects Introduction http://www.l3s.de/ 31/07/14 2  See also: http://purl.org/dietze Stefan Dietze
  3. 3. …why are there so few datasets actually used?  Date reuse and in-links focused on trusted „reference graphs“ such as DBpedia, Freebase etc  Long tail of LD datasets which are neither reused nor linked to (LOD Cloud alone 300+ datasets, 50 bn triples)  Explanations? Linked Data is awesome, but... 31/07/14  „HTTP-accessibility“ (SPARQL, URI-dereferencing)  „Structure“ & „Semantics“ (=> shared/linked vocabularies)  „Interlinked“  „Persistent“ Hm, really? Stefan Dietze
  4. 4. Linked data is more diverse than we think SPARQL endpoint availability over time [Buil-Aranda et al 2013] Accessibility of datasets?  Less than 50% of all SPARQL endpoints actually responsive at given point of time  “THE” SPARQL protocol? No, but many variants & subsets  … “Semantics”, links, quality?  …data consistency? [? Yuan2014 ?]  …data accuracy (eg DBpedia)? [Paulheim2013]  …vocabulary reuse? [D’AquinWebSci13]  …schema compliance (RDFS, schemas) [HoganJWS2012] Stefan Dietze SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil-Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International Semantic Web Conference 2013, (ISWC2013). Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC 2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525 An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012
  5. 5. Too many/diverse datasets, too little knowledge Stefan Dietze 31/07/14 ? ? ? ?? ?  Which datasets are useful & trustworthy for case XY (eg „learning about the solar system“) ? Which topics are covered?  Types: which datasets describe statistics, videos, slides, publications etc?  Currentness, dynamics, accessability/reliability, data quantity & quality?
  6. 6. db:Astro. Objects Dataset Metadata Stefan Dietze 31/07/14 BIBO AAISO FOAF contains Entity disambiguation & linking [ESWC13] Topic profile extraction [WWW13, ESCW14] db:Astronomy db:Astro. Objects Dataset Catalog/Registry yov:Video po:Programme BBC Programme <po:Programme …> <po:Series>Wonders of the Solar System</.> <po:Actor>Brian Cox</…> </po:Programme…> <yo:Video …> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video…> Yovisto Video bibo:Fil bibo:Fi bibo:Film Schema mappings [WebSci13] Data curation, linking and dataset profiling
  7. 7. Schemas/vocabularies on the Web: XKCD 927 Stefan Dietze 31/07/14 https://xkcd.com/927/ schemas & vocabularies
  8. 8. typeX typeX Schema assessment and mapping Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties) Co-occurence after mapping into most frequent schemas (201 frequent types mapped into 79 classes) Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. bibo:Film bibo:Document po:Programme sioc:Item 31/07/14 foaf:Document yov:Video typeX
  9. 9. LinkedUp Data Catalog in a nutshell http://datahub.io/group/linked-education http://data.linkededucation.org/linkedup/catalog/  RDF (VoID) dataset catalog: browse & query distributed datasets  Live information about endpoint accessibility  Federated queries using type mappings Stefan Dietze 31/07/14 http://datahub.io/group/linked-education
  10. 10. 31/07/14 Dataset interlinking recommendation Candidate datasets for interlinking? 13 t Linkset1 Linkset2 Approach  Given dataset t, ranking datasets from D according to probability score (di, t) to contain linking candidates (entities)  Features:  Vocabulary overlap  Existing links (SNA)  Linking candidates likely if datasets share common (a) schema elements, or (b) links (friend of a friend) Conclusions  Roughly 60% MAP for both approaches  Future work: quantity of links, extraction of experimental data from datasets… Lopes, G.R., Paes Leme, L.A.P., Nunes, B.P., Casanova, M.A., Dietze, S., Recommending Tripleset Interlinking through a Social Network Approach, The 14th International Conference on Web Information System Engineering (WISE 2013), Nanjing, China, 2013. Paes Leme, L. A. P., Lopes, G. R., Nunes, B. P., Casanova, M.A., Dietze, S., Identifying candidate datasets for data interlinking, in Proceedings of the 13th International Conference on Web Engineering, (2013). Rank 1 DBLP 2 ACM 3 OAI 4 CiteSeer 5 IBM 6 Roma 7 IEEE 8 Ulm 9 Pisa ? ? Stefan Dietze
  11. 11. <yo:Video 8748720> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video 8748720> Video Topics/categories addressed? Relatedness of resources/entities? (types, semantics) <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Programme Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). Dataset & entity linking: semantics of resources/datasets? 16Stefan Dietze 31/07/14 <sioc:Item 2139393292> <title>Planetary motion & gravity</title> … </sioc:Item 2139393292> Slideset Pluto?
  12. 12. db:Pluto (Dwarf Planet) db:Astrono- mical Objects db:Sun Disambiguation/linking using background knowledge „Semantic relatetedness“ of resources? db:Astronomy 17 <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Programme <sioc:Item 2139393292> <title>Planetary motion & gravity</title> … </sioc:Item 2139393292> Slideset Video Stefan Dietze 31/07/14 <yo:Video 8748720> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video 8748720>
  13. 13. db:Pluto (Dwarf Planet) db:Astrono- mical Objects db:Astronomy  Computation of connectivity scores between resources/entities  Method: combination of a  (i) semantic (graph-based) connectivity score (SCS) with  (ii) a Web co-occurence-based measure (CBM) (similar to NGD)  For (i): adaptation of Katz-Index from SNA for (linked) data graphs (considering path number and path lengths of transversal properties) db:Sun SCS = 0.32 CBM = 0.24 http://purl.org/vol/doc/ http://purl.org/vol/ns/ 19/09/2013 18Stefan Dietze Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). Entity linking: semantic relatedness <sioc:Item 2139393292> <title>Planetary motion & gravity</title> … </sioc:Item 2139393292> Slideset <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Programme <yo:Video 8748720> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video 8748720> Video
  14. 14. Entity linking: evaluation 31/07/14 19Stefan Dietze  Evaluation based on USA Today News items (80.000 entity pairs)  Manually created gold standard (1000 entity pairs)  Baseline: Explicit Semantic Analysis (ESA) => CBM/SCS: „relatedness“; ESA: „similarity“ Precision/Recall/F1 for SCS, CBM, ESA. Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013).
  15. 15. db:Astro. Objects Dataset Metadata Stefan Dietze 31/07/14 Entity disambiguation & linking [ESWC13] Topic profile extraction [WWW13, ESCW14] db:Astronomy db:Astro. Objects Dataset Catalog/Registry yov:Video <yo:Video …> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video…> Yovisto Video  Extracting representative metadata („topic profile“) for datasets  Ranking of most representative (DBpedia) categories (= topics); applied to all responsive LOD datasets  Scalability vs representativeness: sampling & ranking for good scalability/accuracy balance A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). Dataset profiling: what‘s the data about?
  16. 16. Dataset profiling: approach Stefan Dietze 31/07/14 1. Sampling of resource instances (random sampling, weighted sampling, resource centrality sampling) 2. Entity and topic extraction (NER via DBpedia Spotlight, category mapping and expansion) 3. Normalisation and ranking (using graphical- models such as PageRank with Priors, HITS with Priors and K-Step Markov) => Result: weighted dataset-topic profile graph A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014).
  17. 17. Dataset profiling: exploring LOD datasets/topics in a nutshell http://data-observatory.org/lod-profiles/ Stefan Dietze 31/07/14  Automatic extraction of dataset “topics” [ESWC2014] => RDF/VoiD dataset profiles  Visualisation & exploration of dataset-topic graph (datasets, topics, relationships)  Includes all (responsive) datasets of LOD Cloud
  18. 18. Dataset profiling: results evaluation Stefan Dietze 31/07/14 NDCG (averaged over all datasets) . Datasets & Ground Truth  Yovisto, Oxpoints, LAK Dataset, Semantic Web Dogfood  Crowd-sourced topic indicators from datasets (keywords, tags)  Manual mapping to entities & category extraction (ranking according to frequency) Baselines  1) LDA, 2) tf/idf (applied to entire datasets)  Topic extraction according to our approach, weighting/ranking based on term weight Measure  NDCG @ rank l  Performance (time/NDCG) for different sampling strategies/sizes etc
  19. 19. 31/07/14 dbp:Category:Royal_Medal_winners dbp:Category:1955_births dbp:Category:People_from_London dbp:Category:Buzzwords dbp:Category:Web_Services dbp:Category:HTTP dbp:Category:Unitarian_Universalists dbp:Category:World_Wide_Web What have these categories in common? Stefan Dietze
  20. 20. 31/07/14 Diversity of category profile for a single paper Berners-Lee, Tim; Hendler, James, Ora Lassila (2001). "The Semantic Web". Scientific American Magazine. person document dbp:Tim_Berners-Lee dbp:Category:1955_births dbp:Category:People_from_London dbp:Category:Buzzwords dbp:Semantic_Web dbp:Category:Semantic_Web dbp:Category:Web_Services dbp:Category:HTTP dbp:Category:Unitarian_Universalists first-level categories (dcterms:subject) dbp:Category:World_Wide_Web dbp:Category:Royal_Medal_winners Stefan Dietze
  21. 21. 31/07/14 http://data-observatory.org/led-explorer/  Type specific views on datasets/ categories  “Document” (foaf:document)  “Person “ (foaf:person)  “Course” (aaiso:course)  Currently applied to datasets in LinkedUp Catalog only (as schema mappings already available here) Type-specific exploration of dataset categories Stefan Dietze
  22. 22. May –September 2013 October 2013 – May 2014 May 2014 – October 2014 Series of Open Data Competitions to promote applications which exploit Linked Open Data http://www.linkedup-challenge.org/ http://www.linkedup-project.eu/ LinkedUp Challenge: Linking Web Data (for Education)  “Vici” open just now  Final events at ISWC2014  Submission: 5 September
  23. 23. Conclusions & future work Summary  Increasing amounts of data => require knowledge about nature and relationships of datasets  Profiling: scalable methods for extracting dataset metadata  Interlinking: connectivity of entities or datasets Future work – LD evolution, preservation, consistency  In RDF graphs (eg LOD Cloud), „all“ nodes are connected  LD preservation: which datasets to preserve (entity „neighbourhood“)? => semantic relatedness  Link correctness in evolving LD: investigating impact of changes on link correctness  Application: informed preservation and enrichment strategies 31/07/14 37Stefan Dietze
  24. 24. Thank you! WWW See also (general)  http://linkedup-project.eu  http://linkededucation.org  http://data.l3s.de http://purl.org/dietze See also (data)  http://data.linkededucation.org  http://data.linkededucation.org/linkedup/catalog/  http://lak.linkededucation.org 31/07/14 38Stefan Dietze  Besnik Fetahu (L3S)  Elena Demidova (L3S)  Bernardo Pereira Nunes (PUC Rio)  Marco Casanova (PUC Rio)  Luiz Andre Paes Leme (PUC Rio)  Giseli Lopes (PUC Rio)  Davide Taibi (CNR, IT)  Mathieu d’Aquin (Open University, UK)  and many more… Acknowledgements

×