This document discusses profiling and interlinking web datasets. It covers recent work on exploring, discovering, and searching linked data through entity and dataset interlinking recommendations and dataset profiling. It also discusses research areas like web science, information retrieval, and semantic web technologies. Some specific projects are mentioned for dataset profiling, entity linking, and generating structured topic profiles for datasets. Challenges around semantics, schemas, data consistency, and disambiguating entities are also outlined.
Linked Data for Federation of OER Data & RepositoriesStefan Dietze
An overview over different alternatives and opportunities of using Linked Data principles and datasets for federated access to distributed OER repositories. The talk was held at the ARIADNE/GLOBE convening (http://ariadne-eu.org/content/open-federations-2013-open-knowledge-sharing-education) at LAK 2013, Leuven, Belgium on 8 April 2013
Linked Data for Federation of OER Data & RepositoriesStefan Dietze
An overview over different alternatives and opportunities of using Linked Data principles and datasets for federated access to distributed OER repositories. The talk was held at the ARIADNE/GLOBE convening (http://ariadne-eu.org/content/open-federations-2013-open-knowledge-sharing-education) at LAK 2013, Leuven, Belgium on 8 April 2013
Collaboration to Curation: The High Rise Project meets Edinburgh DataShare University of Edinburgh
Slides describing the evolution of the Edinburgh DataShare repository and The High-Rise Project and the (potential) collaborative mechanisms that will enable the digital content to be ingested and preserved in the Edinburgh DataShare DSpace repository environment
Presented by Tony Mathys at a Current Issues and Applications of the Geospatial Technologies Lecture, Department of Geography and Environment, Aberdeen University, 24 February 2012
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)EUDAT
EUDAT and PRACE joined forces to help research communities gain access to high quality managed e-Infrastructures whose resources can be connected together to enable cross-utilization use cases and make them accessible without any technical barrier. The capability to couple data and compute resources together is considered one of the key factors to accelerate scientific innovation and advance research frontiers. The goal of this session was to present the EUDAT services, the results of the collaboration activity achieved so far and delivers a hands-on on how to write a Data Management Plan or DMP. The DMP is a useful instrument for researchers to reflect on and communicate about the way they will deal with their data. It prompts them to think about how they will generate, analyse and share data during their research project and afterwards.
Visit: https://www.eudat.eu/eudat-summer-school
Slides of my talk at OSLCfest in Stockholm Nov 6, 2019
Video recording of the talk is available here:
https://www.facebook.com/oslcfest/videos/2261640397437958/
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...European Data Forum
Selected Talk by Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center, Austria at the European Data Forum 2014, 19 March 2014 in Athens, Greece: CODE - Linked Data in Context: Questions Matter
Data management plans – EUDAT Best practices and case study | www.eudat.euEUDAT
| www.eudat.eu | Presentation given by Stéphane Coutin during the PRACE 2017 Spring School joint training event with the EU H2020 VI-SEEM project (https://vi-seem.eu/) organised by CaSToRC at The Cyprus Institute. Science and more specifically projects using HPC is facing a digital data explosion. Instruments and simulations are producing more and more volume; data can be shared, mined, cited, preserved… They are a great asset, but they are facing risks: we can miss storage, we can lose them, they can be misused,… To start this session, we will review why it is important to manage research data and how to do this by maintaining a Data Management Plan. This will be based on the best practices from EUDAT H2020 project and European Commission recommendation. During the second part we will interactively draft a DMP for a given use case.
Interpreting Data Mining Results with Linked Data for Learning AnalyticsMathieu d'Aquin
Interpreting Data Mining Results with Linked Data for Learning Analytics:Motivation, Case Study and Directions
Presentation at the LAK 2013 conference - 10-04-2013
Towards an Open Research Knowledge GraphSören Auer
The document-oriented workflows in science have reached (or already exceeded) the limits of adequacy as highlighted for example by recent discussions on the increasing proliferation of scientific literature and the reproducibility crisis. Now it is possible to rethink this dominant paradigm of document-centered knowledge exchange and transform it into knowledge-based information flows by representing and expressing knowledge through semantically rich, interlinked knowledge graphs. The core of the establishment of knowledge-based information flows is the creation and evolution of information models for the establishment of a common understanding of data and information between the various stakeholders as well as the integration of these technologies into the infrastructure and processes of search and knowledge exchange in the research library of the future. By integrating these information models into existing and new research infrastructure services, the information structures that are currently still implicit and deeply hidden in documents can be made explicit and directly usable. This has the potential to revolutionize scientific work because information and research results can be seamlessly interlinked with each other and better mapped to complex information needs. Also research results become directly comparable and easier to reuse.
Long-term data curation, aka data preservation - EUDAT Summer School (Marjan ...EUDAT
Marjan will give an overview of the role of data archives in ensuring the safe stewardship and preservation of data over time. She will explain what it means to be a Trustworthy Digital Repository and the associated policies and processes that need to be in place to ensure data provenance and authenticity. This session will link to Monday’s exploration of the re3data.org portal
Visit: https://www.eudat.eu/eudat-summer-school
The paper presents the literature review on long term preservation of 3D architectural building data. The review identified the existing gap in the research and practice of the long term preservation of 3D architectural models,
and suggested future research opportunities in this domain.
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesBesnik Fetahu
The increasing adoption of Linked Data principles has led
to an abundance of datasets on the Web. However, take-up and reuse is hindered by the lack of descriptive information about the nature of the data, such as their topic coverage, dynamics or evolution. To address this issue, we propose an approach for creating linked dataset profiles. A profile consists of structured dataset metadata describing topics and their relevance. Profiles are generated through the configuration of techniques for resource sampling from datasets, topic extraction from reference datasets and their ranking based on graphical models. To enable a good trade-off between scalability and accuracy of generated profiles, appropriate parameters are determined experimentally. Our evaluation considers topic profiles for all accessible datasets from the Linked Open Data cloud. The results show that our approach generates accurate profiles even with comparably small sample sizes (10%) and outperforms established topic modelling approaches.
Collaboration to Curation: The High Rise Project meets Edinburgh DataShare University of Edinburgh
Slides describing the evolution of the Edinburgh DataShare repository and The High-Rise Project and the (potential) collaborative mechanisms that will enable the digital content to be ingested and preserved in the Edinburgh DataShare DSpace repository environment
Presented by Tony Mathys at a Current Issues and Applications of the Geospatial Technologies Lecture, Department of Geography and Environment, Aberdeen University, 24 February 2012
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)EUDAT
EUDAT and PRACE joined forces to help research communities gain access to high quality managed e-Infrastructures whose resources can be connected together to enable cross-utilization use cases and make them accessible without any technical barrier. The capability to couple data and compute resources together is considered one of the key factors to accelerate scientific innovation and advance research frontiers. The goal of this session was to present the EUDAT services, the results of the collaboration activity achieved so far and delivers a hands-on on how to write a Data Management Plan or DMP. The DMP is a useful instrument for researchers to reflect on and communicate about the way they will deal with their data. It prompts them to think about how they will generate, analyse and share data during their research project and afterwards.
Visit: https://www.eudat.eu/eudat-summer-school
Slides of my talk at OSLCfest in Stockholm Nov 6, 2019
Video recording of the talk is available here:
https://www.facebook.com/oslcfest/videos/2261640397437958/
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...European Data Forum
Selected Talk by Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center, Austria at the European Data Forum 2014, 19 March 2014 in Athens, Greece: CODE - Linked Data in Context: Questions Matter
Data management plans – EUDAT Best practices and case study | www.eudat.euEUDAT
| www.eudat.eu | Presentation given by Stéphane Coutin during the PRACE 2017 Spring School joint training event with the EU H2020 VI-SEEM project (https://vi-seem.eu/) organised by CaSToRC at The Cyprus Institute. Science and more specifically projects using HPC is facing a digital data explosion. Instruments and simulations are producing more and more volume; data can be shared, mined, cited, preserved… They are a great asset, but they are facing risks: we can miss storage, we can lose them, they can be misused,… To start this session, we will review why it is important to manage research data and how to do this by maintaining a Data Management Plan. This will be based on the best practices from EUDAT H2020 project and European Commission recommendation. During the second part we will interactively draft a DMP for a given use case.
Interpreting Data Mining Results with Linked Data for Learning AnalyticsMathieu d'Aquin
Interpreting Data Mining Results with Linked Data for Learning Analytics:Motivation, Case Study and Directions
Presentation at the LAK 2013 conference - 10-04-2013
Towards an Open Research Knowledge GraphSören Auer
The document-oriented workflows in science have reached (or already exceeded) the limits of adequacy as highlighted for example by recent discussions on the increasing proliferation of scientific literature and the reproducibility crisis. Now it is possible to rethink this dominant paradigm of document-centered knowledge exchange and transform it into knowledge-based information flows by representing and expressing knowledge through semantically rich, interlinked knowledge graphs. The core of the establishment of knowledge-based information flows is the creation and evolution of information models for the establishment of a common understanding of data and information between the various stakeholders as well as the integration of these technologies into the infrastructure and processes of search and knowledge exchange in the research library of the future. By integrating these information models into existing and new research infrastructure services, the information structures that are currently still implicit and deeply hidden in documents can be made explicit and directly usable. This has the potential to revolutionize scientific work because information and research results can be seamlessly interlinked with each other and better mapped to complex information needs. Also research results become directly comparable and easier to reuse.
Long-term data curation, aka data preservation - EUDAT Summer School (Marjan ...EUDAT
Marjan will give an overview of the role of data archives in ensuring the safe stewardship and preservation of data over time. She will explain what it means to be a Trustworthy Digital Repository and the associated policies and processes that need to be in place to ensure data provenance and authenticity. This session will link to Monday’s exploration of the re3data.org portal
Visit: https://www.eudat.eu/eudat-summer-school
The paper presents the literature review on long term preservation of 3D architectural building data. The review identified the existing gap in the research and practice of the long term preservation of 3D architectural models,
and suggested future research opportunities in this domain.
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesBesnik Fetahu
The increasing adoption of Linked Data principles has led
to an abundance of datasets on the Web. However, take-up and reuse is hindered by the lack of descriptive information about the nature of the data, such as their topic coverage, dynamics or evolution. To address this issue, we propose an approach for creating linked dataset profiles. A profile consists of structured dataset metadata describing topics and their relevance. Profiles are generated through the configuration of techniques for resource sampling from datasets, topic extraction from reference datasets and their ranking based on graphical models. To enable a good trade-off between scalability and accuracy of generated profiles, appropriate parameters are determined experimentally. Our evaluation considers topic profiles for all accessible datasets from the Linked Open Data cloud. The results show that our approach generates accurate profiles even with comparably small sample sizes (10%) and outperforms established topic modelling approaches.
This german presentation was presented at the 8th "Wildauer Bibliothekssymposium" in Wildau, GE. It introduces the audience into the EU funded research project DURAARK and gives an insight for the first archieved goals and next steps concerning the preservation of three dimensional architectural data.
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...lindlar
Presentation of the DURAARK project at the final seminar of the DEDICATE project ("Design's Digital Curation for Architecture") held in Glasgow on October 21st, 2013.
http://architecturedigitalcuration.blogspot.de/
Quality criteria for architectural 3D data in usage and preservation processeslindlar
Quality assessment of digital material has been just one of the new tasks the digital revolution brought into the library domain. With the first big print material digitization
efforts in the digital heritage domain dating back to the 1980ies, plenty of experience has been gathered and recommendations on best-practise published. Along the same line, libraries of today may often publish guidelines on formats or quality parameters for digital textual materials which enter their holdings.
While digital texts such as e-journals are in common use today, non-textual materials of various domains are just entering the holdings of cultural heritage institutions. An
example for this is architectural data, which is of interest to a variety of libraries and archives – ranging from special collection libraries, such as the RIBA Library of the
Royal Institute of British Architects, to national archives responsible for the archival of information about publically funded buildings. Architectural practise of today
commonly includes 3D object processing. The output of these processes is slowly reaching the aforementioned cultural heritage institutions which are now facing the task
of quality assessment of the material.
The presentation will present a first analysis of potential quality factors and compare architectural and cultural heritage domain expectations in 3D data quality. It will look at two forms of 3D data: modelled 3D objects and scanned 3D objects. The work presented in this presentation is based on work conducted in the ongoing EU FP-7 DURAARK project.
DURAARK presentation CIB W78 "Applications of IT in AEC" conference Beijing 2...Jakob Beetz
Presentation of the DURAARK http://duraark.eu/ project at the 30th CIB W78 conference on applications of IT in AEC in Beijing 2013
http://2013cibw78.civil.tsinghua.edu.cn/
This presentation was presented at the IGeLU conference in Oxford, UK. It introduces the audience into the EU funded research project DURAARK and gives an insight for the first archieved goals and next steps concerning the preservation of three dimensional architectural data.
This german presentation was presented at the 19th "Archivierung von Unterlagen aus digitalen Systemen" conference in Vienna, AT. It introduces the audience into the EU funded research project DURAARK and gives an insight for the preservation planning of three dimensional data.
Semantic Linking & Retrieval for Digital LibrariesStefan Dietze
An overview of recent works on entitiy linking and retrieval in large corpora, specifically bibliographic data. The works address both traditional Linked Data and knowledge graphs as well as data extracted from Web markup, such as the Web Data Commons.
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Stefan Dietze
Presentation from mentoring event of Open Education Europa Challenge (http://www.openeducationchallenge.eu/) about using Linked Data in educational applications.
Mining and Understanding Activities and Resources on the WebStefan Dietze
Research Seminar at KMRC Tübingen, Germany, on mining and understanding of Web acivities and resources through knowledge discovery and machine learning approaches.
Doing for Data what Pubmed did for literature: DATS a model for dataset description datasets indexing and data discovery.
Googleslides [https://goo.gl/cd5KKa] or Slideshare [https://goo.gl/c8DH5N]
Data integration in a Hadoop-based data lake: A bioinformatics caseIJDKP
When we work in a data lake, data integration is not easy, mainly because the data is usually
stored in raw format. Manually performing data integration is a time-consuming task that requires the
supervision of a specialist, which can make mistakes or not be able to see the optimal point for data integration among two or more datasets. This paper presents a model to perform heterogeneous in-memory
data integration in a Hadoop-based data lake based on a top-k set similarity approach. Our main contribution is the process of ingesting, storing, processing, integrating, and visualizing the data integration
points. The algorithm for data integration is based on the Overlap coefficient since it presented better
results when compared with the set similarity metrics Jaccard, Sørensen-Dice, and the Tversky index. We
tested our model applying it on eight bioinformatics-domain datasets. Our model presents better results
when compared to an analysis of a specialist, and we expect our model can be reused for other domains of
datasets.
Data integration in a Hadoop-based data lake: A bioinformatics caseIJDKP
When we work in a data lake, data integration is not easy, mainly because the data is usually
stored in raw format. Manually performing data integration is a time-consuming task that requires the
supervision of a specialist, which can make mistakes or not be able to see the optimal point for data integration among two or more datasets. This paper presents a model to perform heterogeneous in-memory
data integration in a Hadoop-based data lake based on a top-k set similarity approach. Our main contribution is the process of ingesting, storing, processing, integrating, and visualizing the data integration
points. The algorithm for data integration is based on the Overlap coefficient since it presented better
results when compared with the set similarity metrics Jaccard, Sørensen-Dice, and the Tversky index. We
tested our model applying it on eight bioinformatics-domain datasets. Our model presents better results
when compared to an analysis of a specialist, and we expect our model can be reused for other domains of
datasets.
Bringing Machine Learning and Knowledge Graphs Together
Six Core Aspects of Semantic AI:
- Hybrid Approach
- Data Quality
- Data as a Service
- Structured Data Meets Text
- No Black-box
- Towards Self-optimizing Machines
Understanding Scientific and Societal Adoption and Impact of Science Through ...Stefan Dietze
Keynote on analysing scholarly discourse at Second International Workshop on Semantic Technologies and Deep Learning Models for Scientific, Technical and Legal Data SemTech4STLD, held on 26 May at ESWC2024
AI in between online and offline discourse - and what has ChatGPT to do with ...Stefan Dietze
Talk at Bonn University on general AI and NLP challenges in the context of online discourse analysis. Specific focus on challenges arising from the widespread adoption of neural large language models.
An interdisciplinary journey with the SAL spaceship – results and challenges ...Stefan Dietze
Keynote at HELMeTO2022 conference, Palermo, Italy on recent research in Search As Learning (SAL), at the intersection of machine learning and cognitive psychology.
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...Stefan Dietze
Inaugural lecture at Heinrich-Heine-University Düsseldorf on 28 May 2019.
Abstract:
When searching the Web for information, human knowledge and artificial intelligence are in constant interplay. On the one hand, human online interactions such as click streams, crowd-sourced knowledge graphs, semi-structured web markup or distributional semantic models built from billions of Web documents are informing machine learning and information retrieval models, for instance, as part of the Google search engine. On the other hand, the very same search engines help users in finding relevant documents, facts, or data for particular information needs, thereby helping users to gain knowledge. This talk will give an overview of recent work in both of the aforementioned areas. This includes 1) research on mining structured knowledge graphs of factual knowledge, claims and opinions from heterogeneous Web documents as well as 2) recent work in the field of interactive information retrieval, where supervised models are trained to predict the knowledge (gain) of users during Web search sessions in order to personalise rankings. Both streams of research are converging as part of online platforms and applications to facilitate access to data(sets), information and knowledge.
Analysing User Knowledge, Competence and Learning during Online ActivitiesStefan Dietze
Research talk given at Italian National Research Council (CNR), Institute for Educational Technologies (ITD) on learning analytics in everyday online activities.
Analysing & Improving Learning Resources Markup on the WebStefan Dietze
Talk at WWW2017 on LRMI adoption, quality and usage. Full paper here: http://papers.www2017.com.au.s3-website-ap-southeast-2.amazonaws.com/companion/p283.pdf.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
What's all the data about? - Linking and Profiling of Linked Datasets
1. What‘s all the data about –
profiling and interlinking Web datasets
Stefan Dietze
L3S Research Center
27/03/14 1Stefan Dietze
2. Recent work on Linked Data exploration/discovery/search
Entity interlinking & dataset interlinking recommendation
Dataset profiling
Data consistency & conflicts
Research areas
Web science, Information Retrieval, Semantic Web & Linked
Data, data & knowledge integration (mapping, classification,
interlinking)
Application domains: education/TEL, Web archiving, …
Some projects
Introduction
http://www.l3s.de/
Stefan Dietze 27/03/14 2
See also: http://purl.org/dietze
3. …why are there so few datasets actually used?
Date reuse and in-links focused on trusted „reference
graphs“ such as DBpedia, Freebase etc
Long tail of LD datasets which are neither reused nor linked
to (LOD Cloud alone 300+ datasets, 50 bn triples)
Explanations?
Linked Data is awesome, but...
27/03/14
„HTTP-accessibility“
(SPARQL, URI-dereferencing)
„Structure“ & „Semantics“
(=> shared/linked vocabularies)
„Interlinked“
„Persistent“
Hm,
really?
Stefan Dietze
4. Linked data is more diverse than we think
SPARQL Web-Querying Infrastructure: Ready for Action?,
Carlos Buil-Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves
Vandenbussch, International Semantic Web Conference 2013,
(ISWC2013).
SPARQL endpoint availability over time [Buil-Aranda et al 2013]
Accessibility of datasets?
Less than 50% of all SPARQL endpoints actually responsive
at given point of time
“THE” SPARQL protocol? No, but many variants & subsets
…
Shared vocabularies & schemas, but:
…still very heterogeneous [d’Aquin, WebSci13]
…data partially messy and not conformant
(RDFS, schemas) [HoganJWS2012]
…even widely used reference datasets such as
DBpedia noisy [Paulheim2013]
Co-occurence graph of data
types in 146 datasets: 144
Vocabularies, 588 highly
overlapping types, 719
Properties
Assessing the Educational Linked Data Landscape, D’Aquin, M.,
Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris,
France, May 2013.
Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic
Web – ISWC 2013, Lecture Notes in Computer Science Volume 8218,
2013, pp 510-525
An empirical survey of Linked Data conformance. Hogan, A., Umbrich,
J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., In the Journal of Web
Semantics 14: pp. 14–44, 2012Stefan Dietze
5. What about data consistency?
Inconsistency and Incompleteness of Linked Datasets – a
Case Study, Yuan, W., Demidova, E., Dietze, S., Zhu, X., Web
Science 2014, WebSci14, under review.
27/03/14
6. Too many/diverse datasets, too little information
Stefan Dietze 27/03/14
?
?
? ?? ?
Which datasets are useful & trustworthy for case
XY (eg „learning about the solar system“) ? Which
topics are covered?
Types: which datasets describe statistics, videos,
slides, publications etc?
Currentness, dynamics, accessability/reliability,
data quantity & quality?
7. Data curation and dataset profiling
Dataset
Catalog/Registry
Stefan Dietze 27/03/14
Catalog of data: classification of
datasets according to resource
types, disciplines/topics, data
quality, accessability, etc
Infrastructure for
distributed/federated querying
describes
Which datasets are useful & trustworthy for case
XY (eg „learning about the solar system“) ? Which
topics are covered?
Types: which datasets describe statistics, videos,
slides, publications etc?
Currentness, dynamics, accessability/reliability,
data quantity & quality?
8. db:Astro. Objects
Dataset profiling: what’s all the data about
Dataset
Metadata
Stefan Dietze 27/03/14
BIBO
AAISO
FOAF
contains
Entity disambiguation &
linking [ESWC13]
Topic profile extraction
[WWW13, ESCW14]
db:Astronomy
db:Astro. Objects
Dataset
Catalog/Registry
yov:Video
po:Programme
BBC Programme
<po:Programme …>
<po:Series>Wonders of the Solar System</.>
<po:Actor>Brian Cox</…>
</po:Programme…>
<yo:Video …>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video…>
Yovisto Video
bibo:Fil
bibo:Fi
bibo:Film
Schema mappings
[WebSci13]
10. Schema assessment and mapping
Co-occurence of
data types
(in 146 datasets:
144 Vocabularies,
588 highly
overlapping types,
719 Properties)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
<po:Programme …>
<po:title>Secret Universe –
The Life of the Cell</po:title>
…
</po:Programme…>
BBC Programme
<sioc:Item …>
<label>Viral diseases &
bacteria</title>
…
</sioc:Item ….>
SlideShare Set
po:Programme
sioc:Item
?
http://datahub.io/group/linked-education
Stefan Dietze 27/03/14
11. Schema assessment and mapping
Co-occurence of
data types
(in 146 datasets:
144 Vocabularies,
588 highly
overlapping types,
719 Properties)
Co-occurence after
mapping into most
frequent schemas
(201 frequent types
mapped into 79
classes)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
bibo:Slideshow
bibo:Film
bibo:Document
<po:Programme …>
<po:title>Secret Universe –
The Life of the Cell</po:title>
…
</po:Programme…>
BBC Programme
<sioc:Item …>
<label>Viral diseases &
bacteria</title>
…
</sioc:Item ….>
SlideShare Set
po:Programme
sioc:Item
Stefan Dietze 27/03/14
12. LinkedUp Data Catalog
in a nutshell http://datahub.io/group/linked-education
http://data.linkededucation.org/linkedup/catalog/
RDF (VoID) dataset catalog: browse &
query distributed datasets
Live information about endpoint
accessibility
Federated queries using type mappings
Stefan Dietze 27/03/14
http://datahub.io/group/linked-education
13. <yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
Topics/categories addressed?
Relatedness of resources/entities?
(types, semantics)
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
Combining a co-occurrence-based and a semantic measure
for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R.
Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended
Semantic Web Conference, (May 2013).
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B., Dietze, S.,
Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended
Semantic Web Conference (ESWC2014), Crete, Greece, (2014).
Challenge: semantics of resources/datasets?
15Stefan Dietze 27/03/14
14. <yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
Data disambiguation (for linking & profiling)
Brian Cox?
Sun?
Pluto?
16Stefan Dietze 27/03/14
15. db:Pluto
(Dwarf
Planet)
db:Astrono-
mical Objects
db:Sun
Data disambiguation using background knowledge
„Semantic relatetedness“ of resources?
db:Astronomy
17
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
<yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video
Stefan Dietze 27/03/14
16. db:Pluto
(Dwarf
Planet)
db:Astrono-
mical Objects
<yov:Lecture8748720>
<title>Pluto & the Dwarf
Planets</title>
…
< yov:Lecture8748720>
Online Lecture
db:Astronomy
Computation of connectivity scores
between resources/entities
Method: combination of a
(i) semantic (graph-based) connectivity
score (SCS) with
(ii) a Web co-occurence-based measure
(CBM) (similar to NGD)
For (i): adaptation of Katz-Index from SNA
for (linked) data graphs (considering path
number and path lengths of transversal
properties)
db:Sun
SCS = 0.32
CBM = 0.24
http://purl.org/vol/doc/
http://purl.org/vol/ns/
19/09/2013 19Stefan Dietze
Combining a co-occurrence-based and a semantic
measure for entity linking, B. P. Nunes, S. Dietze, M.A.
Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013
- 10th Extended Semantic Web Conference, (May 2013).
Entity linking: semantic relatedness
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
17. Entity linking: evaluation
27/03/14 20Stefan Dietze
Evaluation based on USA Today News items (80.000 entity pairs)
Manually created gold standard
(1000 entity pairs)
Baseline: Explicit Semantic Analysis (ESA)
=> CBM/SCS: „relatedness“; ESA: „similarity“
Precision/Recall/F1 for SCS, CBM, ESA.
Combining a co-occurrence-based and a semantic
measure for entity linking, B. P. Nunes, S. Dietze, M.A.
Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013
- 10th Extended Semantic Web Conference, (May 2013).
18. db:Astrono-
mical Objects
db:Astronomy
db:Sun
Extracting representative metadata („topic profile“) for each dataset
Ranking of most representative (DBpedia) categories (= topics); applied to all responsive LOD datasets
Scalability vs representativeness: sampling & ranking for good scalability/accuracy balance
DBpedia category graph
Stefan Dietze 27/03/14
Dataset profiling: what‘s the data about?
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
,(ESWC2014), Crete, Greece, (2014).
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
19. Dataset profiling: approach
Stefan Dietze 27/03/14
1. Sampling of resource instances
(random sampling, weighted sampling, resource
centrality sampling)
2. Entity and topic extraction (NER via DBpedia
Spotlight, category mapping and expansion)
3. Normalisation and ranking (using graphical-
models such as PageRank with Priors, HITS with
Priors and K-Step Markov)
=> Result: weighted dataset-topic profile graph
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
20. Dataset profiling: exploring LOD datasets/topics
in a nutshell http://data-observatory.org/lod-profiles/
Stefan Dietze 27/03/14
Automatic extraction of dataset “topics” [ESWC2014]
Visualisation & exploration of dataset-topic graph
(datasets, topics, relationships)
Includes all (responsive) datasets of LOD Cloud
21. Dataset profiling: results evaluation
Stefan Dietze 27/03/14
NDCG (averaged over all datasets) .
Datasets & Ground Truth
Yovisto, Oxpoints, LAK Dataset, Semantic Web
Dogfood
Crowd-sourced topic indicators from datasets
(keywords, tags)
Manual mapping to entities & category extraction
(ranking according to frequency)
Baselines
1) LDA, 2) tf/idf (applied to entire datasets)
Topic extraction according to our approach,
weighting/ranking based on term weight
Measure
NDCG @ rank l
Performance (time/NDCG) for different sampling
strategies/sizes etc
23. Stefan Dietze 27/03/14
Diversity of category profile for a single paper
Berners-Lee, Tim; Hendler, James, Ora Lassila (2001). "The Semantic Web".
Scientific American Magazine.
person
document
dbp:Tim_Berners-Lee
dbp:Category:1955_births
dbp:Category:People_from_London
dbp:Category:Buzzwords
dbp:Semantic_Web
dbp:Category:Semantic_Web
dbp:Category:Web_Services
dbp:Category:HTTP
dbp:Category:Unitarian_Universalists
first-level categories (dcterms:subject)
dbp:Category:World_Wide_Web
dbp:Category:Royal_Medal_winners
24. DBpedia category graph not an ideal “topic” vocabulary:
Broad and noisy
“Categories” vs “topics” (for capturing disciplines, thesauri
like UMBEL or UNESCO Thesaurus seem better suited)
Hierarchy ?
Filtering of certain partitions of category graph (too generic
categories etc)
Mixing categories across resource types (document, person)
creates “perceived noise”
But: broadness is useful as general vocabulary for
categorisation of all sorts of resource types
Stefan Dietze 27/03/14
Dataset profiling: some lessons learned
25. Stefan Dietze 27/03/14
http://data-observatory.org/led-explorer/
Type specific views on datasets/
categories
“Document” (foaf:document)
“Person “ (foaf:person)
“Course” (aaiso:course)
Currently applied to datasets in
LinkedUp Catalog only (as
schema mappings already
available here)
Type-specific exploration of dataset categories
26. Stefan Dietze 27/03/14
Dataset interlinking recommendation
Candidate datasets for interlinking?
34
t
Linkset1
Linkset2
Problem
Given dataset t, ranking datasets from D
according to probability score (di, t) to
contain linking candidates (entities)
Features:
Vocabulary overlap
Existing links (SNA)
Datasets more likely to contain linking
candidates if they (a) share common
schema elements, or (b) already link to t
or datasets t links to (friend of a friend)
Conclusions
Roughly 60% MAP for both approaches
Future work: quantity of links, more
remote links, extraction of dataset links
rather than data from DataHub
Lopes, G.R., Paes Leme, L.A.P., Nunes, B.P., Casanova, M.A.,
Dietze, S., Recommending Tripleset Interlinking through a
Social Network Approach, The 14th International Conference
on Web Information System Engineering (WISE 2013),
Nanjing, China, 2013.
Paes Leme, L. A. P., Lopes, G. R., Nunes, B. P., Casanova,
M.A., Dietze, S., Identifying candidate datasets for data
interlinking, in Proceedings of the 13th International
Conference on Web Engineering, (2013).
Rank
1 DBLP
2 ACM
3 OAI
4 CiteSeer
5 IBM
6 Roma
7 IEEE
8 Ulm
9 Pisa
?
?
27. Stefan Dietze 27/03/14 37
Success models:
data & applications
LinkedUp Challenge
to identify innovative
tools & applications
Evaluation methods
and approaches
“LinkedUp” – Linking Web Data (for Education)
L
Data linking & curation
Technology transfer
& community-building
Collecting & exposing open
data
=> LinkedUp Data Catalog
Profiling and linking of Web
Data for education
=> educational data graph
[ESWC2013], [ISWC2013],
Disseminating knowledge &
building communities
(educators, computer
scientists, data engineers)
Gathering stakeholder
feedback: use cases, and
requirements
http://linkedup-challenge.org/#usecases
http://linkedup-project.eu/events
http://www.linkedup-challenge.org/
http://data.linkededucation.org
European suport action to
advance take-up of open
data & related technologies
http://www.linkedup-project.eu
29. LinkedUp Challenge: using open data (for learning)
Open Data Competition to promote tools and applications that analyse / integrate (Linked)
Web data
Organised by LinkedUp project over 2 years (“Veni”, “Vidi”, “Vici”) with 40.000 EUR awards
Veni Competition - 22 submissions, 8 shortlisted for presentation at Open Knowledge
Conference (17 September, Geneva Switzerland)
http://linkedup-challenge.org
Stefan Dietze 27/03/14
30. Open & focused track(s)
Final events at ESWC2014
(May, Crete)
Open Track only
Final events at OKCon 2013
(September 2013, Geneva)
Open track & focused tracks
Submission details and calls to be
released soon
Final events at ISWC2014
(October, Riva del Garda, Italy)
May –September 2013 October 2013 – May 2014 May 2014 – October 2014
?
33. Learning Analytics & Knowledge Dataset & Challenge
Facilitating Research on Learning Analytics and EDM
a nutshell
Stefan Dietze 27/03/14
http://lak.linkededucation.org/
http://lak.linkededucation.org/
LAK Dataset (450 publications in RDF/R)
ACM International Conference on Learning Analytics and
Knowledge (LAK) (2011-13)
International Conference on Educational Data Mining (2008-13)
Journal of Educational Data Mining (2008-12)
LAK Data Challenge
Analyse, explore correlate the LAK Dataset
At ACM LAK 2014 (April 2014, Indianapolis)
34. KEYSTONE COST ACTION
27/03/14 51Stefan Dietze
http://www.keystone-cost.eu/
Research network focused on distributed search,
dataset profiling, to Semantic Web, Databases, etc.
Running 2013-2017
WG1: Representation of structured data sources
WG2: Keyword search
WG3: User interaction and query interpretation
WG4: Research integration, showcases,
benchmarks, and evaluations
Open to new members (even beyond Europe)
Joint workshops (eg PROFILES2014 @ ESWC2014)
35. Ongoing/future work … and some upcoming events
Linked Data evolution, preservation, consistency
In RDF graphs (eg LOD Cloud), „all“ nodes are connected
LD preservation: which datasets to preserve (direct links
or even more distant neighbours)?
=> semantic relatedness as guidance for scalable
preservation strategies /data enrichment
Link correctness in evolving LD
Investigating impact of changes on link correctness
(weekly LOD crawls over 1 year time span)
Application: informed preservation strategies
Conflict detection and LD quality (link quality, impact of
conflicts in distant nodes)
PROFILES workshop @ ESWC2014
(http://keystone-cost.eu/profiles2014)
26 May 2014, Crete, Greece
Linking User Data 2014 at UMAP2014
(http://liud.linkededucation.org)
Deadline: 1 April
Online Learning & LD Tutorial at WWW2014
(http://www2014.kr/)
07 April, Seoul
36. Thank you!
WWW
See also (general)
http://linkedup-project.eu
http://linkededucation.org
http://data.l3s.de
http://purl.org/dietze
See also (data)
http://data.linkededucation.org
http://data.linkededucation.org/linkedup/catalog/
http://lak.linkededucation.org
27/03/14 54Stefan Dietze
Besnik Fetahu (L3S)
Bernardo Pereira Nunes (PUC Rio)
Marco Casanova (PUC Rio)
Luiz Andre Paes Leme (PUC Rio)
Giseli Lopes (PUC Rio)
Davide Taibi (CNR, IT)
Mathieu d’Aquin (Open University, UK)
and many more…
Acknowledgements