What's all the data about? - Linking and Profiling of Linked Datasets

1,928 views
1,986 views

Published on

Talk at LIRMM (http://www.lirmm.fr/) on 27 March 2014 about profiling of Linked Data.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,928
On SlideShare
0
From Embeds
0
Number of Embeds
826
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

What's all the data about? - Linking and Profiling of Linked Datasets

  1. 1. What‘s all the data about – profiling and interlinking Web datasets Stefan Dietze L3S Research Center 27/03/14 1Stefan Dietze
  2. 2. Recent work on Linked Data exploration/discovery/search  Entity interlinking & dataset interlinking recommendation  Dataset profiling  Data consistency & conflicts Research areas  Web science, Information Retrieval, Semantic Web & Linked Data, data & knowledge integration (mapping, classification, interlinking)  Application domains: education/TEL, Web archiving, … Some projects Introduction http://www.l3s.de/ Stefan Dietze 27/03/14 2  See also: http://purl.org/dietze
  3. 3. …why are there so few datasets actually used?  Date reuse and in-links focused on trusted „reference graphs“ such as DBpedia, Freebase etc  Long tail of LD datasets which are neither reused nor linked to (LOD Cloud alone 300+ datasets, 50 bn triples)  Explanations? Linked Data is awesome, but... 27/03/14  „HTTP-accessibility“ (SPARQL, URI-dereferencing)  „Structure“ & „Semantics“ (=> shared/linked vocabularies)  „Interlinked“  „Persistent“ Hm, really? Stefan Dietze
  4. 4. Linked data is more diverse than we think SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil-Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International Semantic Web Conference 2013, (ISWC2013). SPARQL endpoint availability over time [Buil-Aranda et al 2013] Accessibility of datasets?  Less than 50% of all SPARQL endpoints actually responsive at given point of time  “THE” SPARQL protocol? No, but many variants & subsets  … Shared vocabularies & schemas, but:  …still very heterogeneous [d’Aquin, WebSci13]  …data partially messy and not conformant (RDFS, schemas) [HoganJWS2012]  …even widely used reference datasets such as DBpedia noisy [Paulheim2013] Co-occurence graph of data types in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC 2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525 An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., In the Journal of Web Semantics 14: pp. 14–44, 2012Stefan Dietze
  5. 5. What about data consistency? Inconsistency and Incompleteness of Linked Datasets – a Case Study, Yuan, W., Demidova, E., Dietze, S., Zhu, X., Web Science 2014, WebSci14, under review. 27/03/14
  6. 6. Too many/diverse datasets, too little information Stefan Dietze 27/03/14 ? ? ? ?? ?  Which datasets are useful & trustworthy for case XY (eg „learning about the solar system“) ? Which topics are covered?  Types: which datasets describe statistics, videos, slides, publications etc?  Currentness, dynamics, accessability/reliability, data quantity & quality?
  7. 7. Data curation and dataset profiling Dataset Catalog/Registry Stefan Dietze 27/03/14  Catalog of data: classification of datasets according to resource types, disciplines/topics, data quality, accessability, etc  Infrastructure for distributed/federated querying describes  Which datasets are useful & trustworthy for case XY (eg „learning about the solar system“) ? Which topics are covered?  Types: which datasets describe statistics, videos, slides, publications etc?  Currentness, dynamics, accessability/reliability, data quantity & quality?
  8. 8. db:Astro. Objects Dataset profiling: what’s all the data about Dataset Metadata Stefan Dietze 27/03/14 BIBO AAISO FOAF contains Entity disambiguation & linking [ESWC13] Topic profile extraction [WWW13, ESCW14] db:Astronomy db:Astro. Objects Dataset Catalog/Registry yov:Video po:Programme BBC Programme <po:Programme …> <po:Series>Wonders of the Solar System</.> <po:Actor>Brian Cox</…> </po:Programme…> <yo:Video …> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video…> Yovisto Video bibo:Fil bibo:Fi bibo:Film Schema mappings [WebSci13]
  9. 9. Schemas/vocabularies on the Web: XKCD 927 Stefan Dietze 27/03/14 https://xkcd.com/927/
  10. 10. Schema assessment and mapping Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties) Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. <po:Programme …> <po:title>Secret Universe – The Life of the Cell</po:title> … </po:Programme…> BBC Programme <sioc:Item …> <label>Viral diseases & bacteria</title> … </sioc:Item ….> SlideShare Set po:Programme sioc:Item ? http://datahub.io/group/linked-education Stefan Dietze 27/03/14
  11. 11. Schema assessment and mapping Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties) Co-occurence after mapping into most frequent schemas (201 frequent types mapped into 79 classes) Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. bibo:Slideshow bibo:Film bibo:Document <po:Programme …> <po:title>Secret Universe – The Life of the Cell</po:title> … </po:Programme…> BBC Programme <sioc:Item …> <label>Viral diseases & bacteria</title> … </sioc:Item ….> SlideShare Set po:Programme sioc:Item Stefan Dietze 27/03/14
  12. 12. LinkedUp Data Catalog in a nutshell http://datahub.io/group/linked-education http://data.linkededucation.org/linkedup/catalog/  RDF (VoID) dataset catalog: browse & query distributed datasets  Live information about endpoint accessibility  Federated queries using type mappings Stefan Dietze 27/03/14 http://datahub.io/group/linked-education
  13. 13. <yo:Video 8748720> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video 8748720> Video <sioc:Item 2139393292> <title>Planetary motion & gravity</title> … </sioc:Item 2139393292> Slideset Topics/categories addressed? Relatedness of resources/entities? (types, semantics) <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Programme Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). Challenge: semantics of resources/datasets? 15Stefan Dietze 27/03/14
  14. 14. <yo:Video 8748720> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video 8748720> Video <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Programme Data disambiguation (for linking & profiling) Brian Cox? Sun? Pluto? 16Stefan Dietze 27/03/14
  15. 15. db:Pluto (Dwarf Planet) db:Astrono- mical Objects db:Sun Data disambiguation using background knowledge „Semantic relatetedness“ of resources? db:Astronomy 17 <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Programme <sioc:Item 2139393292> <title>Planetary motion & gravity</title> … </sioc:Item 2139393292> Slideset <yo:Video 8748720> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video 8748720> Video Stefan Dietze 27/03/14
  16. 16. db:Pluto (Dwarf Planet) db:Astrono- mical Objects <yov:Lecture8748720> <title>Pluto & the Dwarf Planets</title> … < yov:Lecture8748720> Online Lecture db:Astronomy  Computation of connectivity scores between resources/entities  Method: combination of a  (i) semantic (graph-based) connectivity score (SCS) with  (ii) a Web co-occurence-based measure (CBM) (similar to NGD)  For (i): adaptation of Katz-Index from SNA for (linked) data graphs (considering path number and path lengths of transversal properties) db:Sun SCS = 0.32 CBM = 0.24 http://purl.org/vol/doc/ http://purl.org/vol/ns/ 19/09/2013 19Stefan Dietze Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). Entity linking: semantic relatedness <sioc:Item 2139393292> <title>Planetary motion & gravity</title> … </sioc:Item 2139393292> Slideset <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Programme
  17. 17. Entity linking: evaluation 27/03/14 20Stefan Dietze  Evaluation based on USA Today News items (80.000 entity pairs)  Manually created gold standard (1000 entity pairs)  Baseline: Explicit Semantic Analysis (ESA) => CBM/SCS: „relatedness“; ESA: „similarity“ Precision/Recall/F1 for SCS, CBM, ESA. Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013).
  18. 18. db:Astrono- mical Objects db:Astronomy db:Sun  Extracting representative metadata („topic profile“) for each dataset  Ranking of most representative (DBpedia) categories (= topics); applied to all responsive LOD datasets  Scalability vs representativeness: sampling & ranking for good scalability/accuracy balance DBpedia category graph Stefan Dietze 27/03/14 Dataset profiling: what‘s the data about? A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference ,(ESWC2014), Crete, Greece, (2014). <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Programme
  19. 19. Dataset profiling: approach Stefan Dietze 27/03/14 1. Sampling of resource instances (random sampling, weighted sampling, resource centrality sampling) 2. Entity and topic extraction (NER via DBpedia Spotlight, category mapping and expansion) 3. Normalisation and ranking (using graphical- models such as PageRank with Priors, HITS with Priors and K-Step Markov) => Result: weighted dataset-topic profile graph A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014).
  20. 20. Dataset profiling: exploring LOD datasets/topics in a nutshell http://data-observatory.org/lod-profiles/ Stefan Dietze 27/03/14  Automatic extraction of dataset “topics” [ESWC2014]  Visualisation & exploration of dataset-topic graph (datasets, topics, relationships)  Includes all (responsive) datasets of LOD Cloud
  21. 21. Dataset profiling: results evaluation Stefan Dietze 27/03/14 NDCG (averaged over all datasets) . Datasets & Ground Truth  Yovisto, Oxpoints, LAK Dataset, Semantic Web Dogfood  Crowd-sourced topic indicators from datasets (keywords, tags)  Manual mapping to entities & category extraction (ranking according to frequency) Baselines  1) LDA, 2) tf/idf (applied to entire datasets)  Topic extraction according to our approach, weighting/ranking based on term weight Measure  NDCG @ rank l  Performance (time/NDCG) for different sampling strategies/sizes etc
  22. 22. Stefan Dietze 27/03/14 dbp:Category:Royal_Medal_winners dbp:Category:1955_births dbp:Category:People_from_London dbp:Category:Buzzwords dbp:Category:Web_Services dbp:Category:HTTP dbp:Category:Unitarian_Universalists dbp:Category:World_Wide_Web What have these categories in common?
  23. 23. Stefan Dietze 27/03/14 Diversity of category profile for a single paper Berners-Lee, Tim; Hendler, James, Ora Lassila (2001). "The Semantic Web". Scientific American Magazine. person document dbp:Tim_Berners-Lee dbp:Category:1955_births dbp:Category:People_from_London dbp:Category:Buzzwords dbp:Semantic_Web dbp:Category:Semantic_Web dbp:Category:Web_Services dbp:Category:HTTP dbp:Category:Unitarian_Universalists first-level categories (dcterms:subject) dbp:Category:World_Wide_Web dbp:Category:Royal_Medal_winners
  24. 24.  DBpedia category graph not an ideal “topic” vocabulary:  Broad and noisy  “Categories” vs “topics” (for capturing disciplines, thesauri like UMBEL or UNESCO Thesaurus seem better suited)  Hierarchy ?  Filtering of certain partitions of category graph (too generic categories etc)  Mixing categories across resource types (document, person) creates “perceived noise”  But: broadness is useful as general vocabulary for categorisation of all sorts of resource types Stefan Dietze 27/03/14 Dataset profiling: some lessons learned
  25. 25. Stefan Dietze 27/03/14 http://data-observatory.org/led-explorer/  Type specific views on datasets/ categories  “Document” (foaf:document)  “Person “ (foaf:person)  “Course” (aaiso:course)  Currently applied to datasets in LinkedUp Catalog only (as schema mappings already available here) Type-specific exploration of dataset categories
  26. 26. Stefan Dietze 27/03/14 Dataset interlinking recommendation Candidate datasets for interlinking? 34 t Linkset1 Linkset2 Problem  Given dataset t, ranking datasets from D according to probability score (di, t) to contain linking candidates (entities)  Features:  Vocabulary overlap  Existing links (SNA)  Datasets more likely to contain linking candidates if they (a) share common schema elements, or (b) already link to t or datasets t links to (friend of a friend) Conclusions  Roughly 60% MAP for both approaches  Future work: quantity of links, more remote links, extraction of dataset links rather than data from DataHub Lopes, G.R., Paes Leme, L.A.P., Nunes, B.P., Casanova, M.A., Dietze, S., Recommending Tripleset Interlinking through a Social Network Approach, The 14th International Conference on Web Information System Engineering (WISE 2013), Nanjing, China, 2013. Paes Leme, L. A. P., Lopes, G. R., Nunes, B. P., Casanova, M.A., Dietze, S., Identifying candidate datasets for data interlinking, in Proceedings of the 13th International Conference on Web Engineering, (2013). Rank 1 DBLP 2 ACM 3 OAI 4 CiteSeer 5 IBM 6 Roma 7 IEEE 8 Ulm 9 Pisa ? ?
  27. 27. Stefan Dietze 27/03/14 37 Success models: data & applications  LinkedUp Challenge to identify innovative tools & applications  Evaluation methods and approaches “LinkedUp” – Linking Web Data (for Education) L Data linking & curation Technology transfer & community-building  Collecting & exposing open data => LinkedUp Data Catalog  Profiling and linking of Web Data for education => educational data graph [ESWC2013], [ISWC2013],  Disseminating knowledge & building communities (educators, computer scientists, data engineers)  Gathering stakeholder feedback: use cases, and requirements http://linkedup-challenge.org/#usecases http://linkedup-project.eu/events http://www.linkedup-challenge.org/ http://data.linkededucation.org European suport action to advance take-up of open data & related technologies http://www.linkedup-project.eu
  28. 28. Stefan Dietze 27/03/14 17/09/2013 38 Who we areL LinkedUp Network LinkedUp Consortium LinkedUp Advisory Board
  29. 29. LinkedUp Challenge: using open data (for learning)  Open Data Competition to promote tools and applications that analyse / integrate (Linked) Web data  Organised by LinkedUp project over 2 years (“Veni”, “Vidi”, “Vici”) with 40.000 EUR awards  Veni Competition - 22 submissions, 8 shortlisted for presentation at Open Knowledge Conference (17 September, Geneva Switzerland) http://linkedup-challenge.org Stefan Dietze 27/03/14
  30. 30.  Open & focused track(s)  Final events at ESWC2014 (May, Crete)  Open Track only  Final events at OKCon 2013 (September 2013, Geneva)  Open track & focused tracks  Submission details and calls to be released soon  Final events at ISWC2014 (October, Riva del Garda, Italy) May –September 2013 October 2013 – May 2014 May 2014 – October 2014 ?
  31. 31. The Veni shortlist & winners DataConf. KnowNodes Mismuseos ReCredible YourHistory 27/03/14 http://www.globe-town.org/ WeShare - 3rd price / people‘s choice GlobeTown - 2nd price http://seek.cloud.gsic.tel.uva.es/weshare/ http://www.polimedia.nl/ PoliMedia – 1st price
  32. 32. data.l3s.de – a DataHub for the L3S
  33. 33. Learning Analytics & Knowledge Dataset & Challenge Facilitating Research on Learning Analytics and EDM a nutshell Stefan Dietze 27/03/14 http://lak.linkededucation.org/ http://lak.linkededucation.org/ LAK Dataset (450 publications in RDF/R)  ACM International Conference on Learning Analytics and Knowledge (LAK) (2011-13)  International Conference on Educational Data Mining (2008-13)  Journal of Educational Data Mining (2008-12) LAK Data Challenge  Analyse, explore correlate the LAK Dataset  At ACM LAK 2014 (April 2014, Indianapolis)
  34. 34. KEYSTONE COST ACTION 27/03/14 51Stefan Dietze http://www.keystone-cost.eu/  Research network focused on distributed search, dataset profiling, to Semantic Web, Databases, etc.  Running 2013-2017  WG1: Representation of structured data sources  WG2: Keyword search  WG3: User interaction and query interpretation  WG4: Research integration, showcases, benchmarks, and evaluations  Open to new members (even beyond Europe)  Joint workshops (eg PROFILES2014 @ ESWC2014)
  35. 35. Ongoing/future work … and some upcoming events Linked Data evolution, preservation, consistency  In RDF graphs (eg LOD Cloud), „all“ nodes are connected  LD preservation: which datasets to preserve (direct links or even more distant neighbours)? => semantic relatedness as guidance for scalable preservation strategies /data enrichment  Link correctness in evolving LD  Investigating impact of changes on link correctness (weekly LOD crawls over 1 year time span)  Application: informed preservation strategies  Conflict detection and LD quality (link quality, impact of conflicts in distant nodes)  PROFILES workshop @ ESWC2014 (http://keystone-cost.eu/profiles2014)  26 May 2014, Crete, Greece  Linking User Data 2014 at UMAP2014 (http://liud.linkededucation.org)  Deadline: 1 April  Online Learning & LD Tutorial at WWW2014 (http://www2014.kr/)  07 April, Seoul
  36. 36. Thank you! WWW See also (general)  http://linkedup-project.eu  http://linkededucation.org  http://data.l3s.de http://purl.org/dietze See also (data)  http://data.linkededucation.org  http://data.linkededucation.org/linkedup/catalog/  http://lak.linkededucation.org 27/03/14 54Stefan Dietze  Besnik Fetahu (L3S)  Bernardo Pereira Nunes (PUC Rio)  Marco Casanova (PUC Rio)  Luiz Andre Paes Leme (PUC Rio)  Giseli Lopes (PUC Rio)  Davide Taibi (CNR, IT)  Mathieu d’Aquin (Open University, UK)  and many more… Acknowledgements

×