An overview of recent works on entitiy linking and retrieval in large corpora, specifically bibliographic data. The works address both traditional Linked Data and knowledge graphs as well as data extracted from Web markup, such as the Web Data Commons.
Libraries, collections, technology: presented at Pennylvania State University...lisld
Library collections are changing in a network environment. This presentation considers how collections are being reconfigured, it looks at research support services, and it explores the shift from the purchased/licensed collection to the facilitated collection.
Of Libraries and Labs: Effecting User-Driven InnovationAlex Humphreys
JSTOR has launched a new Labs team charged with
partnering with libraries and scholars to build innovative
tools for research and teaching. The JSTOR Labs team has
successfully used ‘flash builds’ – high-intensity, short-burst,
user-driven development efforts – in order to bring an idea
from conception to a working, user-delighting prototype in
as little as a week. In this talk the presenter will describe
the approach to flash builds, highlight the partnerships,
skills, tools and content that help to innovate, and suggest
ways that libraries can adopt these methods to support
innovation and the digital humanities.
The document summarizes a project by JSTOR Labs to create an online tool called Understanding Shakespeare that connects quotes from Shakespeare's plays found in scholarly articles on JSTOR to the passages in the Folger Digital Texts. It details how the tool was developed through a partnership between JSTOR and the Folger Shakespeare Library. It also describes the matching algorithm used and provides usage statistics and information on the open API. It concludes by suggesting potential applications of the matching technology to other texts and corpora.
OCLC and the Social Web:Building tools, providing platforms, engaging the co...Andy Havens
OCLC is the world's largest library cooperative, established in 1967 to reduce costs and increase access to information. It maintains WorldCat, the world's largest database of library records, and provides interlibrary loan and other services to over 71,000 libraries. OCLC is building social features into WorldCat and developing applications for platforms like Facebook to engage users. It also operates blogs and discussion lists to connect with the library community and shares reports and data to further its mission.
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentationekansa
This presentation discusses how a model of “data sharing as publishing” can contribute to developing Linked Open Data resources in archaeology and the study of the ancient world. The paper gives examples from Open Context’s developing approach to data editing, documentation and quality improvement processes. The goal of these efforts is to better align the professional interests of individual researchers with the needs of the larger community to access and use high-quality data in Linked Data scenarios.
Exploring a world of networked information built from free-text metadataShenghui Wang
This document summarizes a presentation about exploring topics through networked information extracted from free-text metadata. It describes challenges in exploring topics and related aspects. It then demonstrates an online interface called Ariadne that addresses these challenges by generating semantic representations of entities from a large dataset and identifying nearest neighbors and related entities through multidimensional scaling. Finally, it discusses potential applications of this approach and references related work.
Towards collaboration at scale: Libraries, the social and the technicallisld
Libraries are now supporting research and learning behaviors in data rich network environments. This presentation looks at some examples focusing on how an emphasis on individual systems needs to give way to a broader view of process, workflow and behaviors.
It also discusses how this environment creates a demand for collaboration at scale among libraries.
Collection directions - towards collective collectionslisld
How the emergence of new research and learning workflows in digital environments is affecting library collecting and collections. Several trends are reviewed. In the light of diversifying competing requirements, the need to manage down print and develop shared print responses is discussed.
Presentation to OCLC Asia Pacific Regional Council meeting. 13 Oct. 2014.
Libraries, collections, technology: presented at Pennylvania State University...lisld
Library collections are changing in a network environment. This presentation considers how collections are being reconfigured, it looks at research support services, and it explores the shift from the purchased/licensed collection to the facilitated collection.
Of Libraries and Labs: Effecting User-Driven InnovationAlex Humphreys
JSTOR has launched a new Labs team charged with
partnering with libraries and scholars to build innovative
tools for research and teaching. The JSTOR Labs team has
successfully used ‘flash builds’ – high-intensity, short-burst,
user-driven development efforts – in order to bring an idea
from conception to a working, user-delighting prototype in
as little as a week. In this talk the presenter will describe
the approach to flash builds, highlight the partnerships,
skills, tools and content that help to innovate, and suggest
ways that libraries can adopt these methods to support
innovation and the digital humanities.
The document summarizes a project by JSTOR Labs to create an online tool called Understanding Shakespeare that connects quotes from Shakespeare's plays found in scholarly articles on JSTOR to the passages in the Folger Digital Texts. It details how the tool was developed through a partnership between JSTOR and the Folger Shakespeare Library. It also describes the matching algorithm used and provides usage statistics and information on the open API. It concludes by suggesting potential applications of the matching technology to other texts and corpora.
OCLC and the Social Web:Building tools, providing platforms, engaging the co...Andy Havens
OCLC is the world's largest library cooperative, established in 1967 to reduce costs and increase access to information. It maintains WorldCat, the world's largest database of library records, and provides interlibrary loan and other services to over 71,000 libraries. OCLC is building social features into WorldCat and developing applications for platforms like Facebook to engage users. It also operates blogs and discussion lists to connect with the library community and shares reports and data to further its mission.
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentationekansa
This presentation discusses how a model of “data sharing as publishing” can contribute to developing Linked Open Data resources in archaeology and the study of the ancient world. The paper gives examples from Open Context’s developing approach to data editing, documentation and quality improvement processes. The goal of these efforts is to better align the professional interests of individual researchers with the needs of the larger community to access and use high-quality data in Linked Data scenarios.
Exploring a world of networked information built from free-text metadataShenghui Wang
This document summarizes a presentation about exploring topics through networked information extracted from free-text metadata. It describes challenges in exploring topics and related aspects. It then demonstrates an online interface called Ariadne that addresses these challenges by generating semantic representations of entities from a large dataset and identifying nearest neighbors and related entities through multidimensional scaling. Finally, it discusses potential applications of this approach and references related work.
Towards collaboration at scale: Libraries, the social and the technicallisld
Libraries are now supporting research and learning behaviors in data rich network environments. This presentation looks at some examples focusing on how an emphasis on individual systems needs to give way to a broader view of process, workflow and behaviors.
It also discusses how this environment creates a demand for collaboration at scale among libraries.
Collection directions - towards collective collectionslisld
How the emergence of new research and learning workflows in digital environments is affecting library collecting and collections. Several trends are reviewed. In the light of diversifying competing requirements, the need to manage down print and develop shared print responses is discussed.
Presentation to OCLC Asia Pacific Regional Council meeting. 13 Oct. 2014.
We used to think of the user in the life of the library. Now we think of the library in the life of the user. As behaviors change in a network environment, we have seen growing interest in ethnographic and user-centered design approaches. This presentation introduces this topic. It also explores changes in how we manage collections as an illustration of this shift towards thinking of the library in the life of the user.
Library futures: converging and diverging directions for public and academic ...lisld
The major influence on library futures is the changing character of their user communities. As patterns of research, learning and personal development change in a network environment so library services need to change. At the same time, libraries are focused on engaging with their communities more strongly - getting into their work and learning flows. This means that libraries are becoming more unlike each other, they are diverging as they meet the specific needs of their communities. Research libraries diverge from academic libraries, and each is different from urban public libraries, and so on.
At the same time, at a broader level libraries are experiencing similar pressures. The need to engage more strongly with their communities. The need to assess what they do. The need to configure space around experiences rather than around collections. Libraries are converging around some of these issues.
This presentation will consider the future of libraries from the point of view of convergence and divergence between types of libraries.
Full Spectrum Stewardship of the Scholarly Record by Brian E. C. Schottlaende...Charleston Conference
Brian Schottlaender discusses the full-spectrum stewardship of the scholarly record. He defines the spectrum as a continuum ranging from stable, established scholarly outputs like journal articles and archives, to less stable outputs like blogs and data. Libraries have historically played a role in curating and preserving the stable portions of the record. However, the digital environment has expanded the types of scholarly resources and introduced new challenges around their long-term management. Effective stewardship of the entire spectrum requires partnerships across different stakeholders and institutions.
The Library in the Life of the User: Two Collection Directionslisld
Our understanding of library collections is changing in a digital, network environment. This presentation focuses on two trends in this context. First, the inside-out library is a trend which sees libraries support the creation, management and discoverability of institutional materials: research data, expertise, preprints, and so on. Second, the facilitated collection is a trend which sees libraries increasingly organize resources around user interests, whether these resources are external, collaborative or locally acquired.
This presentation was given at 'The transformation of academic library collecting: a symposium inspired by Dan C. Hazen'. Harvard Library, 20/21 Oct. 2016
The facilitated collection: collections and collecting in a network environmentlisld
We often think of collections as local – whether owned or licensed. Increasingly this picture is changing in several ways. Libraries are sharing responsibility for collections. Libraries are providing access to materials they do not own, but which are available to their users (freely available digital book collections for example). Demand driven acquisitions changes the view of local collections. Institutions are also thinking about how to manage locally produced materials (research data for example) and support access across institutions. This trend is supported by changes as discovery is peeled away from local collections. This presentation discusses these trends, and collections and discovery change in a network environment.
This was a presentation at the Libraries Australia Forum, Melbourne, 2015
Social metadata for libraries, archives and museums: Research findings from t...Rose Holley
The document summarizes the findings of the RLG Social Metadata Working Group regarding the use of social metadata in libraries, archives, and museums. The working group reviewed 76 relevant websites, surveyed 42 site managers, and developed 18 recommendations. Key recommendations include having clear objectives for social media use, establishing guidelines for staff and user-generated content, preparing staff, and continuously evaluating usability. The full report provides analysis of survey results, case studies of third-party site use, and additional recommendations.
Collections unbound: collection directions and the RLUK collective collectionlisld
A presentation given to RLUK Members' meeting at the University of Warwick.
The library identity has been closely bound with its collection. However this is changing as research and learning behaviours evolve in a network environment. There are three interesting trends. First, atttention is shifting from a library-centric view of a locally owned collection to a user-centred view of a facilitated collection in places where the library can add value. Second, there is growing emphasis on support for creation, for the process of research, as well as for the products, the article or book. And third, we are seeing a changing perspective on the historic core, the print book collection. Increasingly, this is being seen in collective ways as institutions manage down print, or think about its management in cooperative settings, or retire collections as space is reconfigured around research and learning experiences. This presentation also provides preliminary findings for the analysis being carried out by OCLC Research of the RLUK collective collection.
Library collections and the emerging scholarly recordlisld
A high level review of collection trends followed by a summary of recent work on the evolving scholarly record.
Presented at the OCLC Research Library Partnership meeting at the University of Melbourne, 2 December 2015.
Working collaboratively: scaling infrastructure, services, learning and innov...lisld
1. The document discusses collaborative activities in libraries, identifying three main areas: shared service infrastructure, cooperative negotiation and licensing, and professional development and networking.
2. It analyzes libraries through the lenses of an organizational perspective focused on infrastructure, engagement, and innovation, and a service configuration perspective oriented around collections, space, services, and support for student success and research.
3. The key is finding the right scale for collaborative activities to increase engagement, leverage infrastructure, and scale learning and innovation to support the evolving role of libraries.
Libraries: technology as artifact and technology in practicelisld
Research and learning workflows are increasingly enacted in data-rich network environments. New behaviors are emerging which are shaped by and in turn shape workflow and data tools and services. This means that library attention is shifting from not only providing support systems and services but to supporting those behaviors more directly as they emerge. This support may take the form of particular system or services, but will also involve consulting and advising about such things as publication venues, reputation management, profiles, research networking.
A keynote presentation given at the Association of Jesuit Colleges and Universities CITM and Library Deans meeting. Loyola University, Maryland.
Presentation at EMTACL10, http://www.ntnu.no/ub/emtacl/
Guus van den Brekel
Central medical library, UMCG
Virtual Research Networks: towards Research 2.0
In the next few years, the further development of social, educational and research networks – with its extensive collaborative possibilities – will be dictating how users will search for, manage and exchange information. The network – evolved by technology – is changing the user's behaviour and that will affect the future of information services. Many envision a possible leading role for libraries in collaboration and community building services.
Users are not only heavily using new tools, but are also creating and shaping their own preferred tools.
Today's students are incorporating Web 2.0 skills in daily life, in their social and learning environments.
Tomorrow's research staff will expect to be able to use their preferred tools and resources within their work environment.
Today's ánd tomorrow's libraries should support students and staff in the learning and research process by integrating library services and resources into their environments.
Library discovery: past, present and some futureslisld
A presentation at the NISO virtual conference on Webscale Discovery Services, 20 November 2013.
Considers some of the issues that have led to the adoption of these services, and some future directions.
Distinguishes between discovery (providing a library destination) and discoverability (making stuff discoverable elsewhere).
Challenges and opportunities for academic librarieslisld
Research and learning behaviors are changing in a network environment. What challenges do Academic libraries face? What opportunities do they have? A presentation given at a symposium on the future of academic libraries at the Open University.
Describing Theses and Dissertations Using Schema.orgOCLC
This document summarizes a project that developed an extension of the Schema.org vocabulary to better describe theses, dissertations, and other materials in institutional repositories. The project team modeled repository entities, academic departments, and relationships between classes. They published example RDF data and loaded all records from a university repository as RDF descriptions. Their work aims to make repository content more visible to search engines and help libraries demonstrate their value on the semantic web.
This presentation was given at Bobcatsss2013 in Ankara.
Once the library assembled a collection and people came to the library to use it. Now, people build communication, workflows and behaviors around a variety of network resources. The library needs to think about how it is visible and relevant in those workflows and behaviors.
Presented at Industry Symposium, IFLA, 14 August 2008. Describes a new environment of global information services using metadata, taxonomies, and knowledge organization. Makes the case that these changes will permanently affect what it means "to catalog" materials for the purpose of connecting citizens, students and scholars to the information they need, when and where they need it.
The document discusses how linked open data and semantic web technologies can be applied to educational data and resources on the web. It provides examples of projects that aim to expose, interlink, and enrich educational datasets using these technologies. The goal is to improve data sharing and interoperability, facilitate reuse of open educational resources, and leverage linked data as a knowledge base to support learning and education.
Retrieval, Crawling and Fusion of Entity-centric Data on the WebStefan Dietze
Stefan Dietze gave a keynote presentation covering three main topics:
1) Challenges in entity retrieval from heterogeneous linked datasets and knowledge graphs due to diversity and lack of standardization.
2) Approaches for enabling discovery and search through dataset recommendation, profiling, and entity retrieval methods that cluster entities to address link sparsity.
3) Going beyond linked data to exploit semantics embedded in web markup, with case studies in data fusion for entity reconciliation and retrieval.
We used to think of the user in the life of the library. Now we think of the library in the life of the user. As behaviors change in a network environment, we have seen growing interest in ethnographic and user-centered design approaches. This presentation introduces this topic. It also explores changes in how we manage collections as an illustration of this shift towards thinking of the library in the life of the user.
Library futures: converging and diverging directions for public and academic ...lisld
The major influence on library futures is the changing character of their user communities. As patterns of research, learning and personal development change in a network environment so library services need to change. At the same time, libraries are focused on engaging with their communities more strongly - getting into their work and learning flows. This means that libraries are becoming more unlike each other, they are diverging as they meet the specific needs of their communities. Research libraries diverge from academic libraries, and each is different from urban public libraries, and so on.
At the same time, at a broader level libraries are experiencing similar pressures. The need to engage more strongly with their communities. The need to assess what they do. The need to configure space around experiences rather than around collections. Libraries are converging around some of these issues.
This presentation will consider the future of libraries from the point of view of convergence and divergence between types of libraries.
Full Spectrum Stewardship of the Scholarly Record by Brian E. C. Schottlaende...Charleston Conference
Brian Schottlaender discusses the full-spectrum stewardship of the scholarly record. He defines the spectrum as a continuum ranging from stable, established scholarly outputs like journal articles and archives, to less stable outputs like blogs and data. Libraries have historically played a role in curating and preserving the stable portions of the record. However, the digital environment has expanded the types of scholarly resources and introduced new challenges around their long-term management. Effective stewardship of the entire spectrum requires partnerships across different stakeholders and institutions.
The Library in the Life of the User: Two Collection Directionslisld
Our understanding of library collections is changing in a digital, network environment. This presentation focuses on two trends in this context. First, the inside-out library is a trend which sees libraries support the creation, management and discoverability of institutional materials: research data, expertise, preprints, and so on. Second, the facilitated collection is a trend which sees libraries increasingly organize resources around user interests, whether these resources are external, collaborative or locally acquired.
This presentation was given at 'The transformation of academic library collecting: a symposium inspired by Dan C. Hazen'. Harvard Library, 20/21 Oct. 2016
The facilitated collection: collections and collecting in a network environmentlisld
We often think of collections as local – whether owned or licensed. Increasingly this picture is changing in several ways. Libraries are sharing responsibility for collections. Libraries are providing access to materials they do not own, but which are available to their users (freely available digital book collections for example). Demand driven acquisitions changes the view of local collections. Institutions are also thinking about how to manage locally produced materials (research data for example) and support access across institutions. This trend is supported by changes as discovery is peeled away from local collections. This presentation discusses these trends, and collections and discovery change in a network environment.
This was a presentation at the Libraries Australia Forum, Melbourne, 2015
Social metadata for libraries, archives and museums: Research findings from t...Rose Holley
The document summarizes the findings of the RLG Social Metadata Working Group regarding the use of social metadata in libraries, archives, and museums. The working group reviewed 76 relevant websites, surveyed 42 site managers, and developed 18 recommendations. Key recommendations include having clear objectives for social media use, establishing guidelines for staff and user-generated content, preparing staff, and continuously evaluating usability. The full report provides analysis of survey results, case studies of third-party site use, and additional recommendations.
Collections unbound: collection directions and the RLUK collective collectionlisld
A presentation given to RLUK Members' meeting at the University of Warwick.
The library identity has been closely bound with its collection. However this is changing as research and learning behaviours evolve in a network environment. There are three interesting trends. First, atttention is shifting from a library-centric view of a locally owned collection to a user-centred view of a facilitated collection in places where the library can add value. Second, there is growing emphasis on support for creation, for the process of research, as well as for the products, the article or book. And third, we are seeing a changing perspective on the historic core, the print book collection. Increasingly, this is being seen in collective ways as institutions manage down print, or think about its management in cooperative settings, or retire collections as space is reconfigured around research and learning experiences. This presentation also provides preliminary findings for the analysis being carried out by OCLC Research of the RLUK collective collection.
Library collections and the emerging scholarly recordlisld
A high level review of collection trends followed by a summary of recent work on the evolving scholarly record.
Presented at the OCLC Research Library Partnership meeting at the University of Melbourne, 2 December 2015.
Working collaboratively: scaling infrastructure, services, learning and innov...lisld
1. The document discusses collaborative activities in libraries, identifying three main areas: shared service infrastructure, cooperative negotiation and licensing, and professional development and networking.
2. It analyzes libraries through the lenses of an organizational perspective focused on infrastructure, engagement, and innovation, and a service configuration perspective oriented around collections, space, services, and support for student success and research.
3. The key is finding the right scale for collaborative activities to increase engagement, leverage infrastructure, and scale learning and innovation to support the evolving role of libraries.
Libraries: technology as artifact and technology in practicelisld
Research and learning workflows are increasingly enacted in data-rich network environments. New behaviors are emerging which are shaped by and in turn shape workflow and data tools and services. This means that library attention is shifting from not only providing support systems and services but to supporting those behaviors more directly as they emerge. This support may take the form of particular system or services, but will also involve consulting and advising about such things as publication venues, reputation management, profiles, research networking.
A keynote presentation given at the Association of Jesuit Colleges and Universities CITM and Library Deans meeting. Loyola University, Maryland.
Presentation at EMTACL10, http://www.ntnu.no/ub/emtacl/
Guus van den Brekel
Central medical library, UMCG
Virtual Research Networks: towards Research 2.0
In the next few years, the further development of social, educational and research networks – with its extensive collaborative possibilities – will be dictating how users will search for, manage and exchange information. The network – evolved by technology – is changing the user's behaviour and that will affect the future of information services. Many envision a possible leading role for libraries in collaboration and community building services.
Users are not only heavily using new tools, but are also creating and shaping their own preferred tools.
Today's students are incorporating Web 2.0 skills in daily life, in their social and learning environments.
Tomorrow's research staff will expect to be able to use their preferred tools and resources within their work environment.
Today's ánd tomorrow's libraries should support students and staff in the learning and research process by integrating library services and resources into their environments.
Library discovery: past, present and some futureslisld
A presentation at the NISO virtual conference on Webscale Discovery Services, 20 November 2013.
Considers some of the issues that have led to the adoption of these services, and some future directions.
Distinguishes between discovery (providing a library destination) and discoverability (making stuff discoverable elsewhere).
Challenges and opportunities for academic librarieslisld
Research and learning behaviors are changing in a network environment. What challenges do Academic libraries face? What opportunities do they have? A presentation given at a symposium on the future of academic libraries at the Open University.
Describing Theses and Dissertations Using Schema.orgOCLC
This document summarizes a project that developed an extension of the Schema.org vocabulary to better describe theses, dissertations, and other materials in institutional repositories. The project team modeled repository entities, academic departments, and relationships between classes. They published example RDF data and loaded all records from a university repository as RDF descriptions. Their work aims to make repository content more visible to search engines and help libraries demonstrate their value on the semantic web.
This presentation was given at Bobcatsss2013 in Ankara.
Once the library assembled a collection and people came to the library to use it. Now, people build communication, workflows and behaviors around a variety of network resources. The library needs to think about how it is visible and relevant in those workflows and behaviors.
Presented at Industry Symposium, IFLA, 14 August 2008. Describes a new environment of global information services using metadata, taxonomies, and knowledge organization. Makes the case that these changes will permanently affect what it means "to catalog" materials for the purpose of connecting citizens, students and scholars to the information they need, when and where they need it.
The document discusses how linked open data and semantic web technologies can be applied to educational data and resources on the web. It provides examples of projects that aim to expose, interlink, and enrich educational datasets using these technologies. The goal is to improve data sharing and interoperability, facilitate reuse of open educational resources, and leverage linked data as a knowledge base to support learning and education.
Retrieval, Crawling and Fusion of Entity-centric Data on the WebStefan Dietze
Stefan Dietze gave a keynote presentation covering three main topics:
1) Challenges in entity retrieval from heterogeneous linked datasets and knowledge graphs due to diversity and lack of standardization.
2) Approaches for enabling discovery and search through dataset recommendation, profiling, and entity retrieval methods that cluster entities to address link sparsity.
3) Going beyond linked data to exploit semantics embedded in web markup, with case studies in data fusion for entity reconciliation and retrieval.
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebStefan Dietze
This document discusses enabling discovery and search of linked data and knowledge graphs. It presents approaches for dataset recommendation including using vocabulary overlap and existing links between datasets. It also discusses profiling datasets to create topic profiles using entity extraction and ranking techniques. These recommendation and profiling approaches aim to help with discovering relevant datasets and entities for a given topic or task.
WWW2013 Tutorial: Linked Data & EducationStefan Dietze
Linked data provides opportunities for sharing educational data on the web in a standardized way. It allows for the integration of heterogeneous educational resources and datasets from different platforms. This can enable new applications like cross-platform recommender systems and exploratory search. However, there are also challenges to address like annotation overhead, performance, and scalability when dealing with large amounts of distributed data.
This document discusses dataset profiling and the LinkedUp data catalog. It describes how LinkedUp profiles 34 educational datasets, including information on their schemas, accessibility, and topic coverage. It also explains the benefits of dataset profiling, such as enabling federated querying and exploratory search over multiple datasets. Finally, it outlines techniques for profiling linked data and applications of the profiles through tools like Cite4Me and the LinkedUp data catalog.
Mining and Understanding Activities and Resources on the WebStefan Dietze
Research Seminar at KMRC Tübingen, Germany, on mining and understanding of Web acivities and resources through knowledge discovery and machine learning approaches.
Web Science Synergies: Exploring Web Knowledge through the Semantic WebStefan Dietze
The document discusses exploring web data and knowledge through the semantic web. It describes how the semantic web adds meaning to data through shared vocabularies and schemas. It also discusses challenges with the large number and diversity of linked open datasets, including issues with accessibility, heterogeneity of schemas, and data quality. It proposes approaches to address these challenges, such as dataset profiling, metadata catalogs, and infrastructure for federated querying.
Turning Data into Knowledge (KESW2014 Keynote)Stefan Dietze
The document discusses turning data into knowledge through profiling and interlinking web datasets. It covers recent work on linked data exploration, discovery, and search including entity and dataset interlinking recommendations and dataset profiling. It also discusses ensuring data consistency and resolving conflicts. The document then examines challenges with reusing and interlinking the long tail of linked datasets and issues regarding structure, semantics, interlinking, and persistence of linked data on the web.
Riding the wave - Paradigm shifts in information accessdatacite
The document discusses the paradigm shifts in scientific information access over time from empirical observation to computational simulation. It outlines the challenges libraries now face in providing access to non-textual scientific content like research data and simulations. The document also introduces DataCite, a global consortium that issues digital object identifiers (DOIs) to datasets to help make them accessible, citable, and traceable like scholarly articles.
What's all the data about? - Linking and Profiling of Linked DatasetsStefan Dietze
This document discusses profiling and interlinking web datasets. It covers recent work on exploring, discovering, and searching linked data through entity and dataset interlinking recommendations and dataset profiling. It also discusses research areas like web science, information retrieval, and semantic web technologies. Some specific projects are mentioned for dataset profiling, entity linking, and generating structured topic profiles for datasets. Challenges around semantics, schemas, data consistency, and disambiguating entities are also outlined.
Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...Natsuko Nicholls
The document discusses data sharing policies and mandates from various organizations including federal funding agencies in the US and internationally, journals, and a paradigm shift toward more transparent and collaborative research that integrates publications and data. Key points include requirements for data management plans from NIH and NSF, expectations of funding agencies in other countries to maximize access to research data, a journal policy requiring data to be made available, and challenges around measuring the impact of shared data given the lack of common practices and standards for citing data.
LinkedUp - Linked Data Europe Workshop 2014Stefan Dietze
The document discusses the LinkedUp project, which aims to advance the use of open data and linked data technologies in education. Specifically:
1. It describes how linked data can be used to improve data sharing and interpretation across isolated education platforms by facilitating a vision of open education.
2. It outlines plans to collect and expose open education data through a LinkedUp Data Catalog to make diverse datasets more discoverable and useful for learning applications.
3. It summarizes the LinkedUp Challenge competition which promotes tools and applications that analyze and integrate web data, with winners being recognized at various conferences.
Open Data Dialog 2013 - Linked Data in EducationStefan Dietze
The document discusses opportunities and challenges of using linked data in education. It begins by outlining how linked data principles can be useful for sharing educational data by providing background knowledge and common standards and vocabularies. However, it notes that currently only a few datasets are actually reused or linked, in part due to heterogeneity in datasets, unreliable metadata, and a lack of links between datasets. The LinkedUp project aims to address these issues by collecting and profiling open educational datasets, generating links between them, and building applications and tools to help utilize the data. Key activities include developing a dataset catalog, generating topic profiles of datasets, running challenges to identify innovative applications, and engaging stakeholders.
This presentation was provided by Chris Erdmann of Library Carpentries and by Judy Ruttenberg of ARL during the NISO virtual conference, Open Data Projects, held on Wednesday, June 13, 2018.
Edinburgh DataShare: Tackling research data in a DSpace institutional repositoryRobin Rice
1) The document discusses Edinburgh DataShare, a data repository at the University of Edinburgh that was established as part of the DISC-UK DataShare project to explore new ways for academics to share research data over the internet.
2) It describes lessons learned from establishing the repository, including that top-down drivers are important for data sharing, and that data libraries can help bridge communication between researchers and repository managers.
3) The document recommends that institutions develop research data policies to clarify rights and responsibilities regarding data sharing and management.
Demo: Profiling & Exploration of Linked Open DataStefan Dietze
This document discusses profiling and exploring linked datasets on the web. It describes the LinkedUp dataset catalog which classifies datasets by type, topic, quality and accessibility. The catalog allows querying across distributed datasets. Topic profiles of datasets are extracted by entity disambiguation and mapping dataset schemas. Visualizations show the relationships between datasets, topics and categories. Lessons learned are that broad categories from DBpedia introduce noise, and type-specific views of datasets can provide more precise topic profiles, as demonstrated in an explorer of educational datasets.
RO-Crate: packaging metadata love notes into FAIR Digital ObjectsCarole Goble
Abstract
slides available at: https://zenodo.org/record/7147703#.Y7agoxXP2F4
The Helmholtz Metadata Collaboration aims to make the research data [and software] produced by Helmholtz Centres FAIR for their own and the wider science community by means of metadata enrichment [1]. Why metadata enrichment and why FAIR? Because the whole scientific enterprise depends on a cycle of finding, exchanging, understanding, validating, reproducing), integrating and reusing research entities across a dispersed community of researchers.
Metadata is not just “a love note to the future” [2], it is a love note to today’s collaborators and peers. Moreover, a FAIR Commons must cater for the metadata of all the entities of research – data, software, workflows, protocols, instruments, geo-spatial locations, specimens, samples, people (well as traditional articles) – and their interconnectivity. That is a lot of metadata love notes to manage, bundle up and move around. Notes written in different languages at different times by different folks, produced and hosted by different platforms, yet referring to each other, and building an integrated picture of a multi-part and multi-party investigation. We need a crate!
RO-Crate [3] is an open, community-driven, and lightweight approach to packaging research entities along with their metadata in a machine-readable manner. Following key principles - “just enough” and “developer and legacy friendliness - RO-Crate simplifies the process of making research outputs FAIR while also enhancing research reproducibility and citability. As a self-describing and unbounded “metadata middleware” framework RO-Crate shows that a little bit of packaging goes a long way to realise the goals of FAIR Digital Objects (FDO)[4], and to not just overcome platform diversity but celebrate it while retaining investigation contextual integrity.
In this talk I will present the why, and how Research Object packaging eases Metadata Collaboration using examples in big data and mixed object exchange, mixed object archiving and publishing, mass citation, and reproducibility. Some examples come from the HMC, others from EOSC, USA and Australia, and from different disciplines.
Metadata is a love note to the future, RO-Crate is the delivery package.
[1] https://helmholtz-metadaten.de/en
[2] Scott, Jason The Metadata Mania, http://ascii.textfiles.com/archives/3181, June 2011
[3] Soiland-Reyes, Stian et al. “Packaging Research Artefacts with RO-Crate”. Data Science, 2022; 5(2):97-138, DOI: 10.3233/DS-210053
[4] De Smedt K, Koureas D, Wittenburg P. “FAIR Digital Objects for Science: From Data Pieces to Actionable Knowledge Units”. Publications. 2020; 8(2):21. https://doi.org/10.3390/publications8020021
Big Data in Learning Analytics - Analytics for Everyday LearningStefan Dietze
This document summarizes Stefan Dietze's presentation on big data in learning analytics. Some key points:
- Learning analytics has traditionally focused on formal learning environments but there is interest in expanding to informal learning online.
- Examples of potential big data sources mentioned include activity streams, social networks, behavioral traces, and large web crawls.
- Challenges include efficiently analyzing large datasets to understand learning resources and detect learning activities without traditional assessments.
- Initial models show potential to predict learner competence from behavioral traces with over 90% accuracy.
Similar to Semantic Linking & Retrieval for Digital Libraries (20)
Understanding Scientific and Societal Adoption and Impact of Science Through ...Stefan Dietze
Keynote on analysing scholarly discourse at Second International Workshop on Semantic Technologies and Deep Learning Models for Scientific, Technical and Legal Data SemTech4STLD, held on 26 May at ESWC2024
AI in between online and offline discourse - and what has ChatGPT to do with ...Stefan Dietze
Talk at Bonn University on general AI and NLP challenges in the context of online discourse analysis. Specific focus on challenges arising from the widespread adoption of neural large language models.
An interdisciplinary journey with the SAL spaceship – results and challenges ...Stefan Dietze
Keynote at HELMeTO2022 conference, Palermo, Italy on recent research in Search As Learning (SAL), at the intersection of machine learning and cognitive psychology.
Research Knowledge Graphs at NFDI4DS & GESISStefan Dietze
Research Knowledge Graphs (RKGs) can help address challenges in data science like reproducibility and bias by making relationships between scientific resources like data, publications, and methods explicit and machine-interpretable. GESIS is constructing large-scale RKGs using natural language processing and deep learning methods to extract knowledge graphs about software and data usage from millions of publications. These RKGs power semantic search and enable new social science research using datasets like TweetsKB, which contains over 10 billion annotated tweets. The NFDI4DS aims to build a joint RKG by connecting existing RKGs through common standards and identifiers.
Beyond research data infrastructures: exploiting artificial & crowd intellige...Stefan Dietze
This document discusses using artificial and crowd intelligence to build research knowledge graphs from online data sources. It describes harvesting metadata about research datasets from open data portals and web pages marked up with schemas like RDFa. Machine learning techniques are used to clean and fuse the harvested metadata into a knowledge graph. The knowledge graph can be queried to provide information about research datasets and related entities. Additional methods are discussed for linking mentions of datasets in scholarly publications to real-world datasets.
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...Stefan Dietze
Inaugural lecture at Heinrich-Heine-University Düsseldorf on 28 May 2019.
Abstract:
When searching the Web for information, human knowledge and artificial intelligence are in constant interplay. On the one hand, human online interactions such as click streams, crowd-sourced knowledge graphs, semi-structured web markup or distributional semantic models built from billions of Web documents are informing machine learning and information retrieval models, for instance, as part of the Google search engine. On the other hand, the very same search engines help users in finding relevant documents, facts, or data for particular information needs, thereby helping users to gain knowledge. This talk will give an overview of recent work in both of the aforementioned areas. This includes 1) research on mining structured knowledge graphs of factual knowledge, claims and opinions from heterogeneous Web documents as well as 2) recent work in the field of interactive information retrieval, where supervised models are trained to predict the knowledge (gain) of users during Web search sessions in order to personalise rankings. Both streams of research are converging as part of online platforms and applications to facilitate access to data(sets), information and knowledge.
Using AI to understand everyday learning on the WebStefan Dietze
1) The document discusses using artificial intelligence to understand informal learning that occurs on the web through people's everyday activities like searching online.
2) It describes several research projects aimed at detecting learning behaviors and predicting users' knowledge gains from analyzing patterns in their search histories, browsing activities, and other online traces.
3) The goal is to develop models that support learners in efficiently finding reliable information online and gauging their "learning to learn" skills, and applying these to specific online platforms commonly used for daily learning.
Analysing User Knowledge, Competence and Learning during Online ActivitiesStefan Dietze
Research talk given at Italian National Research Council (CNR), Institute for Educational Technologies (ITD) on learning analytics in everyday online activities.
Analysing & Improving Learning Resources Markup on the WebStefan Dietze
Talk at WWW2017 on LRMI adoption, quality and usage. Full paper here: http://papers.www2017.com.au.s3-website-ap-southeast-2.amazonaws.com/companion/p283.pdf.
Towards embedded Markup of Learning Resources on the WebStefan Dietze
This document analyzes the usage of terms from the Learning Resources Metadata Initiative (LRMI) embedded in web pages. It finds that from 2013 to 2014 there was a significant growth in LRMI adoption, with more distinct classes used but fewer overall documents. The most common learning resource types were worksheets and games. Several errors were also observed in LRMI statements, such as capitalization issues and undefined properties. The analysis is limited to a subset of web pages marked up as creative works, and ongoing work aims to analyze the full subset to further understand how LRMI is being used on the web.
Linked Data for Architecture, Engineering and Construction (AEC)Stefan Dietze
The document discusses the relationship between building information modeling (BIM) and the semantic web. It provides an introduction to linked data and describes how semantic web technologies can be used to add contextual and background knowledge to BIM data, such as geographical, historical, and statistical information. It also addresses challenges around preserving and maintaining the evolution of linked BIM and architecture data on the semantic web.
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Stefan Dietze
Presentation from mentoring event of Open Education Europa Challenge (http://www.openeducationchallenge.eu/) about using Linked Data in educational applications.
From Data to Knowledge - Profiling & Interlinking Web DatasetsStefan Dietze
This document discusses profiling and interlinking web datasets. It describes recent work on entity and dataset interlinking, dataset profiling, and data consistency. It also discusses challenges such as the long tail of linked data datasets that are rarely reused or linked to. The document proposes approaches to dataset profiling through topic extraction and metadata generation. It also discusses methods for computing semantic relatedness between entities and recommending candidate datasets for interlinking.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Semantic Linking & Retrieval for Digital Libraries
1. Backup
Semantic Linking & Retrieval for Digital Libraries
Dr. Stefan Dietze
11.02.2016
Institut für Informatik/Universität Bonn
29/03/16 1Stefan Dietze
2. Stefan Dietze
Overview: research/application context
Information (types)
Bibliographic (meta)data
Research information
Educational (meta)data
Web & social data
Stakeholders
Archival organisations
Digital libraries
Publishers
Resource providers/
consumers
Domains
Life Sciences
Computer Science
Learning Analytics
...
Data-centric tasks
Publishing, preservation, annotation, crawling, search, retrieval ...
29/03/16 2Stefan Dietze
3. Overview: contents
Introduction & motivation
Publishing, linking and profiling
Publishing & linking (bibliographic) data
Dataset profiling & linking
Retrieval & search
Entity retrieval in large graphs
Embedded (bibliographic) Web data
Entity summarisation from Web markup
Outlook and future directions
Stefan Dietze
Information (types)
Bibliographic (meta)data
Research information
Educational (meta)data
Web & social data
Stakeholders
Archival organisations
Digital libraries
Publishers
....
Domains
Life Sciences
Computer Science
Learning Analytics
...
Data-centric tasks
Publishing, preservation, annotation, crawling, search, retrieval ...
29/03/16 3Stefan Dietze
4. Introduction & motivation
Publishing, linking and profiling
Publishing & linking (bibliographic) data
Dataset profiling & linking
Retrieval & search
Entity retrieval in large graphs
Embedded (bibliographic) Web data
Entity summarisation from Web markup
Outlook and future directions
Overview: contents
knowledge graphs and linked data
beyond LD: embedded semantics
[ESWC13, ESCW14]
[ISWC15]
[WebSci13, SWJ15]
Stefan Dietze
Information (types)
Bibliographic (meta)data
Research information
Educational (meta)data
Web & social data
Stakeholders
Archival organisations
Digital libraries
Publishers
....
Domains
Life Sciences
Computer Science
Learning Analytics
...
Data-centric tasks
Publishing, preservation, annotation, crawling, search, retrieval ...
[ongoing]
29/03/16 4Stefan Dietze
5. Linked Data diversity: example library & scholarly data
Linked Data: W3C standards & de-facto standard for sharing data on the Web (roughly 1000 datasets, 100 bn
triples), adopted specifically by library/GLAM sector & life sciences
Strong focus on established knowledge graphs, e.g. Yago, DBpedia, Freebase (still)
Vocabularies/Schemas
BIBO, Bibliographic Ontology
BIRO, Bibliographic Reference Ontology
CITO, Citation Typing Ontology
SPAR vocabularies (incl. CITO, BIRO)
SWRC (Semantic Web Dogfood)
Functional Req. for Bibliographic Records (FRBR)
Nature Publishing Group Ontology
mEducator Educational Resources
....
Datasets
EUROPEANA
British Library
Deutsche-, Französische-, Spanische
Nationalbibliotheken
Nature Publishing Group
Hochschulbibliothekszentrum NRW
Elsevier Scholarly Publications
TED Talks
mEducator Linked Educational Resources
Open Courseware Consortium
LAK Dataset
...
Initiatives
W3C Library Linked Data Incubator Group
Linked Library Data group on DataHub
LinkedUniversities.org
LinkedEducation.org
W3C Linked Open Education Community Group
...
29/03/16 5Stefan Dietze
7. Data publishing, linking and profiling: LinkedUp
Dataset
Catalog/Registry
http://data.linkededucation.org/linkedup/catalog/
LinkedUp project (FP7 project: L3S, OU, OKFN, Elsevier, Exact Learning solutions)
LinkedUp Catalog: largest collection of LD/Open Data for educationally relevant resources (approx. 50 Datasets)
Original datasets published with key content providers, automatically extracted metadata
29/03/16 7Stefan Dietze
8. Dietze, S., Kaldoudi, E., Dovrolis, E., Giordano, D.,
Spampinato, C., Hendrix, M., Protopsaltis, A., Taibi, D., Yu,
H. Q. (2013), Socio-semantic Integration of Educational
Resources – the Case of the mEducator Project, in
Journal of Universal Computer Science (J.UCS), Vol. 19,
No. 11, pp. 1543-1569.
Dietze, S., Taibi, D., Yu, H. Q., Dovrolis, N., A Linked
Dataset of Medical Educational Resources, British
Journal of Educational Technology (BJET), Volume 46,
Issue 5, pages 1123–1129, September 2015.
mEducator: medical educational resources
EC-funded eContentPlus project (2009-2012)
Exploratory search through semantic and clustering techniques
Lifting/enriching/clustering medical metadata
Common vocabularies (MESH, SNOMED, Bioportal etc)
mEducator dataset: first Linked Data corpus of enriched OER
metadata, used by number of applications
29/03/16 8Stefan Dietze
9. LAK Dataset: facilitating scientometrics
Concept ofType #
Reference npg:Citation 7885
Author foaf:Person 1214
Conference Paper swrc:InProceedings 652
Organization foaf:Organization 365
Journal Paper bibo:Article 45
Proceedings Volume swrc:Proceedings 15
Journal Volume bibo:Journal 9
Cooperation of
Linked Data corpus of „Learning Analytics“publications
of last 5 years (~ 800 publications)
Metadata, full-text & automated linking
(DBLP, SWDF, DBpedia)
Wide adoption (http://lak.linkededucation.org)
1. Data extraction & vocabulary definition
2.3. Applications & analysis Entity co-reference resolution & linking
Facilitating Scientometrics in Learning Analytics and
Educational Data Mining - the LAK Dataset, Dietze, S.,
Taibi, D., D’Aquin, M.,Semantic Web Journal, 2015.
29/03/16 9Stefan Dietze
10. 29/03/16 10Stefan Dietze
LinkedUp Catalog: dataset index & registry, federated searchn a
nutshell “Federated queries” through schema mappings
Dataset accessability
Linking & topic profiling
Schema/Types
11. Co-occurence of
types
(in 146 datasets:
144 vocabularies,
588 overlapping
types, 719
predicates)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
po:Programme
yov:Video
?
bibo:Book
Schema analysis & mapping
29/03/16 11Stefan Dietze
12. typeX
typeX
Co-occurence after
mapping
(201 frequently
occuring types,
mapped into 79 types)
bibo:Film
bibo:Document
po:Programme
bibo:Book
foaf:Document
yov:Video
typeX
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
Schema analysis & mapping
Co-occurence of
types
(in 146 datasets:
144 vocabularies,
588 overlapping
types, 719
predicates)
29/03/16 12
14. contains
yov:Video
<yo:Video …>
<dc:title> Lecture 29 –
Stem Cells </dc:title>
…
</yo:Video…>
Yovisto Video
db:Medicine
db:Rudolf
Virchow
db:Cell
Biology
Linking entities/datasets through combination of (i)
„semantic (graph-based) connectivity score (SCS)“ (based
on Katz centrality) and „co-occurence-based measure
(CBM)“ (similar to Normalised Google Distance)
Evaluation: outperforming Explicit Semantic Analysis (ESA)
SCS = 0.32
CBM = 0.24
Data(set) interlinking
bibo:Book
British Library Book
<bibo:Book …>
<bibo:title>Über den Hungertyphus</.>
<bibo:creator>Rudolf Virchov</…>
</bibo:Book…>
Combining a co-occurrence-based and a semantic
measure for entity linking, B. P. Nunes, S. Dietze, M.A.
Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013
- 10th Extended Semantic Web Conference, (May 2013).
?
29/03/16 14
db:Cell
(Biology)
db:Cell(Micro-
processor)
Stefan Dietze
15. db:Biology
db:Cell biology
Dataset
Catalog/Registry
yov:Video
<yo:Video …>
<dc:title>Lecture 29 –
Stem Cells</dc:title>
…
</yo:Video…>
Yovisto Video
Extraction of representative (DBpedia) categories („topic profile“) for arbitrary datasets
Technically trivial, but scalability issues: LOD Cloud 1000+ datasets with <100 billion RDF statements
Efficient approach: sampling & ranking for balance between scalability and precision /recall
Scalable profiling of datasets
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
db:Cell
(Biology)
29/03/16 15
db:Cell
(Biology)
Stefan Dietze
16. Efficient dataset profiling
1. Sampling of resources
(random sampling, weighted sampling, resource
centrality sampling)
2. Entity- & topic-extraction (NER via DBpedia Spotlight,
category mapping & -expansion)
3. Normalisation & ranking (graph-based models such as
PageRank with Priors, HITS with Priors & K-Step Markov)
Result: weighted dataset-topic profile graph
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
29/03/16 16Stefan Dietze
17. Search & exploration of datasets through topic profiles
in a nutshell Applied to entire LOD cloud/graph
Visual exploration of extracted RDF dataset profiles
(datasets, topics, relationships)
Evaluation results: K-Step Markov (10% sampling size)
outperforms baselines (LDA, tf/idf on entire datasets)
http://data-observatory.org/lod-profiles/
29/03/16 17Stefan Dietze
18. Search: entity retrieval on large structured datasets?
in a nutshell
Challenges
How to efficiently retrieve related entities/resources for given query ?
Explicit entity links (owl:sameAs etc) are sparse yet important to facilitate state of the art methods
(eg BM25F, Blanco et al, ISWC2011)
Query type affinity?
29/03/16 18Stefan Dietze
??
Large dataset/crawl
e.g. LinkedUp dataset graph, LIVIVO dataset, BTC2014
entities related to <James D. Watson>
?
BTC2014
19. Entity retrieval: approach
in a nutshell
(I) Offline processing (clustering to address link sparsity)
1. Feature vectors (lexical and structural features)
2. Bucketing: per type (LSH algorithm)
3. Clustering: X-means & Spectral clustering per bucket
Improving Entity Retrieval on Structured Data,
Fetahu, B., Gadiraju, U., Dietze, S., 14th International
Semantic Web Conference (ISWC2014), Bethlehem,
US, (2015).
(II) Online processing (retrieval)
1. Retrieval & expansion:
a) BM25F results
b) expansion from clusters (related entities)
2. Re-Ranking
(context terms & query type affinity)
29/03/16 19Stefan Dietze
20. Dataset
BTC2014 (1.4 billion triples)
92 SemSearch queries
Methods
Our approaches: XM: Xmeans, SP: Spectral
Baselines B: BM25F, S1: Tonon et al [SIGIR12]
Conclusions
XM & SP outperform baselines
Clustering to remedy link sparsity
Relevance to query crucial
Improving Entity Retrieval on Structured Data,
Fetahu, B., Gadiraju, U., Dietze, S., 14th International
Semantic Web Conference (ISWC2014), Bethlehem,
US, (2015).
Entity retrieval: evaluation
29/03/16 20Stefan Dietze
21. Introduction & motivation
Publishing, linking and profiling
Publishing & linking (bibliographic) data
Dataset profiling & linking
Retrieval & search
Entity retrieval in large graphs
Embedded (bibliographic) Web data
Entity summarisation from Web markup
Outlook and future directions
Overview: contents so far
29/03/16 21Stefan Dietze
[ESWC13, ESCW14]
[ISWC15]
[WebSci13, SWJ15]
Outcomes & impact ?
22. Tangible outcomes / impact
Open Datasets
Applications
Vocabularies & Schemas
Initiatives & Working Groups
VOL
+ vocabularies for educational resource & service modeling
W3C Community Group
„Open Linked Education“
DCMI Task Force on LRMI
W3C Schema Bib Extend Group
Tutorial & workshop series on
Linked Data & Learning
LinkedUniversities, LinkedEducation.org
KEYSTONE WG „Search and Profiling of LD“
….
http://linkeduniversties.org
29/03/16 22Stefan Dietze
23. Introduction & motivation
Publishing, linking and profiling
Publishing & linking (bibliographic) data
Dataset profiling & linking
Retrieval & search
Entity retrieval in large graphs
Embedded (bibliographic) Web data
Entity summarisation from Web markup
Outlook and future directions
Overview: contents
beyond LD: embedded semantics
Stefan Dietze
Information (types)
Bibliographic (meta)data
Research information
Educational (meta)data
Web & social data
Stakeholders
Archival organisations
Digital libraries
Publishers
....
Domains
Life Sciences
Computer Science
Learning Analytics
...
Data-centric tasks
Publishing, preservation, annotation, crawling, search, retrieval ...
29/03/16 23Stefan Dietze
24. The Web: approx. 46.000.000.000.000 (46 trillion) Web pages indexed
by Google
vs
Linked Data: approx. 1000 datasets & 100 billion statements
- different order of magnitude wrt scale & dynamics
Other „semantics“ (structured facts) on the Web?
The Web as a knowledge base: semantics on the Web?
29/03/16 24Stefan Dietze
25. Embedded markup (RDFa, Microdata, Microformats) for
interpretation of Web documents (search, retrieval)
Arbitrary vocabularies; schema.org used at scale:
(700 classes, 1000 predicates)
Adoption on the Web: 26 %
(2014 Google study of 12 bn Web pages)
“Web Data Commons” (Meusel & Paulheim [ISWC2014])
• Markup from Common Crawl (2.2 billion pages):
17 billion RDF quads
• Markup in 26% of pages, 14% of PLDs in 2013
(increase from 6% in 2011)
Same order of magnitude as “the Web”
Embedded semantics: Web page markup & schema.org
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Forrest Gump</h1>
<span>Actor: <span itemprop=„actor">Tom Hanks</span>
<span itemprop="genre">Drama</span>
...
</div>
29/03/16 25
RDF statements
node1 actor _node-x
node1 actor Robin Wright
node1 genre Comedy
node2 actor T. Hanks
node2 distributed by Paramount Pic.
node3 actor Tom Cruise
node3 distributed by Paramount Pic.
Stefan Dietze
26. 29/03/16 26Stefan Dietze
Characteristics Example
Coreferences
18.000 results for <„Iphone 6“, type, s:Product>
(8,6 quads on average)
Redundancy
<s, schema:name, „Iphone 6“> occuring 1000
times in WDC2013
Lack of links Largely unlinked entity descriptions / subgraphs
Errors
(typos & schema
violations, see
Meusel et al
[ESWC2015])
Wrong namespaces, such as http://schma.org
Undefined types & predicates:
9,7 % in WDC, less common than in LOD
Confusion of datatype and object properties:
<s1, s:publisher, „Springer“>, 24,35 % object
property issues vs 8% in LOD
Data property range violations: e.g. literals vs
numbers (12,6% in WDC vs 4,6 in LOD)
Using markup as global knowledge base - state of the art
Glimmer (http://glimmer.research.yahoo.com):
entity retrieval (BM25F) on WDC dataset
[Blanco, Mika & Vigna, ISWC2011]
Challenges: specific characteristics of markup data
27. Goal: obtaining entity summary (or entity-centric knowledge graph) for given query ?
Tasks: document annotation, knowledge base augmentation, semantic enrichments
Using markup as global knowledge base/graph?
Web page
markup
29/03/16 27Stefan Dietze
Query
Nucleic Acids, type:(Article)
Entity Summary/Graph
Name
Molecular structure of nucleic
acids
author
James D. Watson
Francis Crick
publisher Nature
datePublished 1953
Web crawls, WDC or large (domain-specific) crawls:
e.g. publishers, universities, libraries etc
28. Candidate Facts
node1 name
Molecular structure
of nucleic acids
node1 author James D. Watson
node1 publisher Nature
node1 datePublished 1956
node1 datePublished 1953
node2 name Francis Crick
node2 name Cricks
Extract (domain-specific) knowledge bases and knowledge graphs for digital libraries
Experiments on WDC data: 87,6 % MAP, coverage: on average 57% additional facts compared to DBpedia
Ongoing work: entity summarisation from markup data
Query
Nucleic Acids, type:(Article) 1. Retrieval
2. Fact selection
Entity Summary/Graph
Name
Molecular structure of nucleic
acids
author
James D. Watson
Francis Crick
publisher Nature
datePublished 1953
29/03/16 28
New Queries
James D. Watson, type:(Person)
Francis Crick, type:(Person)
Nature, type:(Organization)
Stefan Dietze
Web crawls, WDC or large (domain-specific) crawls:
e.g. publishers, universities, libraries etc
Web page
markup
(clustering, heuristics, trained classifier)
29. 1
10
100
1000
10000
100000
1000000
10000000
1 51 101 151 201
count(log)
PLD (ranked)
# entities # statements
Unprecedented source of bibliographic data
Metadata about scholarly articles
(s:ScholarlyArticle): 6.793.764 quads, 1.184.623
entities, 429 distinct predicates (in WDC / 1 type
alone)
Top 5 domains: Springer, MDPI, BMJ,
diabetesjournals.org, mendeley.com,
Biodiversitylibrary.org
Domains, topics, disciplines?
Life Sciences and Computer Science predominant
Top-10 article titles
Most important publishers/journals, libraries
represented
=> Domain-specific & targeted crawls
= unprecedented source of data
Embedded data for digital libraries / life sciences?
29/03/16 29Stefan Dietze
30. Knowledge graphs and LD
(Yago, Freebase, Pubmed, DBLP etc)
Entity
node1 name
Molecular structure of
nucleic acids
node1 author James D. Watson
node1 publisher Nature
node1 datePublished 1956
node1 datePublished 1953
Future work: improving entity-centric tasks for digital libraries
29/03/16 30
Entity
node2 name Francis Crick
node2 name Cricks
node2 born 1916
Stefan Dietze
• Web data as knowledge resource
• Background knowledge/structured data
• Training data & ground truths
• ....
Embedded
data
Unstructured (Web)
documents
Linked Data
Improving data-centric tasks for large
(bibliographic/life sciences) corpora, eg LIVIVO
• KB construction & augmentation
• Document annotation
• Entity recognition, disambiguation, interlinking
• Search & retrieval ...
32. References (presented work)
Dietze, S., Taibi, D., D’Aquin, M., Facilitating Scientometrics in Learning Analytics and Educational Data Mining - the LAK Dataset,
Semantic Web Journal, 2016.
Dietze, S., Kaldoudi, E., Dovrolis, E., Giordano, D., Spampinato, C., Hendrix, M., Protopsaltis, A., Taibi, D., Yu, H. Q. (2013), Socio-
semantic Integration of Educational Resources – the Case of the mEducator Project, in Journal of Universal Computer Science (J.UCS),
Vol. 19, No. 11, pp. 1543-1569.
Dietze, S., Taibi, D., Yu, H. Q., Dovrolis, N., A Linked Dataset of Medical Educational Resources, British Journal of Educational
Technology (BJET), Volume 46, Issue 5, pages 1123–1129, September 2015.
Gadiraju, U., Demartini, G., Kawase, R., Dietze, S. Human beyond the Machine: Challenges and Opportunities of Microtask
Crowdsourcing. In: IEEE Intelligent Systems, Volume 30 Issue 4 – Jul/Aug 2015.
Gadiraju, U., Kawase, R., Dietze, S, Demartini, G., Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online
Surveys. ACM CHI Conference on Human Factors in Computing Systems (CHI2015), April 18-23, Seoul, Korea.
Fetahu, B., Gadiraju, U., Dietze, S., Improving Entity Retrieval on Structured Data, 14th International Semantic Web Conference
(ISWC2014), Bethlehem, US, (2015).
Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., A Scalable Approach for Efficiently Generating Structured Dataset Topic
Profiles, 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014).
D’Aquin, M., Adamou, A., Dietze, S., Assessing the Educational Linked Data Landscape, ACM Web Science 2013 (WebSci2013), Paris,
France, May 2013.
Nunes, B. P., Dietze, S., Casanova, M.A., Kawase, R., Fetahu, B., Nejdl, W., Combining a co-occurrence-based and a semantic measure
for entity linking, in: The Semantic Web: Semantics and Big Data, Proceedings of the 10th Extended Semantic Web Conference
(ESWC2013), Lecture Notes in Computer Science Vol. 7882, Springer Berlin Heidelberg, 2013.
http://www.stefandietze.net
29/03/16 32Stefan Dietze
33. Selected related work
Entity retrieval
Alberto Tonon, Gianluca Demartini, and Philippe Cudré-Mauroux. Combining Inverted Indices and Structured
Search for Ad-hoc Object Retrieval. In: 35th Annual ACM SIGIR Conference (SIGIR 2012), Portland, Oregon,
USA, August 2012.
Roi Blanco, Peter Mika, Sebastiano Vigna: Effective and Efficient Entity Search in RDF Data. International
Semantic Web Conference (ISWC) 2011, pages 83-97.
Embedded markups & Web Data Commons
Robert Meusel, Petar Petrovski, Christian Bizer: The WebDataCommons Microdata, RDFa and Microformat
Dataset Series. Proceedings of the 13th International Semantic Web Conference (ISWC 2014), RBDS Track,
Trentino, Italy, October 2014.
Robert Meusel and Heiko Paulheim: Heuristics for Fixing Common Errors in Deployed schema.org Microdata.
Proceedings of the 12th Extended Semantic Web Conference (ESWC 2015), Portoroz, Slovenia, May 2015
Linked Data quality
Carlos Buil-Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, SPARQL Web-Querying
Infrastructure: Ready for Action?, International Semantic Web Conference 2013, (ISWC2013).
Paulheim H., Bizer, C., Type Inference on Noisy RDF Data, Semantic Web – ISWC 2013, Lecture Notes in
Computer Science Volume 8218, 2013, pp 510-525
Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., An empirical survey of Linked Data
conformance. Journal of Web Semantics 14, 2012
29/03/16 33Stefan Dietze