The presentation for the W3C Semantic Web in Health Care and Life Sciences community group by Slava Tykhonov, DANS-KNAW, the Royal Netherlands Academy of Arts and Sciences (October 2020). The recording is available https://www.youtube.com/watch?v=G9oiyNM_RHc
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataversevty
This presentation is about external CVs support in Dataverse, Open Source data repository. Data Archiving and Networked Services (DANS-KNAW) decided to use Dataverse as a basic technology to build Data Stations and provide FAIR data services for various Dutch research communities.
This document discusses the 5 year evolution of Dataverse, an open source data repository platform. It began as a tool for collaborative data curation and sharing within research teams. Over time, features were added like dataset version control, APIs, and integration with other systems. The document outlines challenges around maintenance and sustainability. It also covers efforts to improve Dataverse's interoperability, such as integrating metadata standards and controlled vocabularies, and making datasets FAIR compliant. The goal is to establish Dataverse as a core component of the European Open Science Cloud by improving areas like software quality, integration with tools, and standardization.
Technical integration of data repositories status and challengesvty
This document discusses technical integration of data repositories, including:
- Previous integration initiatives focused on metadata integration using OAI-PMH and ResourceSync protocols, as well as aggregators like OpenAIRE.
- Challenges to integration include different levels of software/service maturity, maintenance of distributed applications, and use of common standards and vocabularies.
- Potential integration efforts could focus on improving FAIRness, metadata/data flexibility, and connections between repositories, software, and computing resources to better enable reuse of EOSC data and services.
Automated CI/CD testing, installation and deployment of Dataverse infrastruct...vty
This document summarizes an presentation about automating CI/CD testing, installation, and deployment of Dataverse in the European Open Science Cloud. It discusses using Docker and Kubernetes for deployment, a community-driven QA plan using pyDataverse for test automation, and providing quality assurance as a service. The presentation also covers topics like the CESSDA maturity model, integrating Dataverse on Google Cloud, and using serverless computing for some Dataverse applications and services.
Ontologies, controlled vocabularies and Dataversevty
Presentation on Semantic Web technologies for Dataverse Metadata Working Group running by Institute for Quantitative Social Science (IQSS) of Harvard University.
External controlled vocabularies support in Dataversevty
This presentation discusses adding support for external controlled vocabularies to the Dataverse data repository platform. It describes how ontologies like SKOS can be used to represent vocabularies and allow linking metadata fields in Dataverse to terms. The presentation proposes developing a Semantic Gateway plugin for Dataverse that would allow browsing and linking to external vocabularies hosted in the SKOSMOS framework via its API. This could improve metadata by allowing standardized, linked terms and help make data more FAIR.
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataversevty
This presentation is about external CVs support in Dataverse, Open Source data repository. Data Archiving and Networked Services (DANS-KNAW) decided to use Dataverse as a basic technology to build Data Stations and provide FAIR data services for various Dutch research communities.
This document discusses the 5 year evolution of Dataverse, an open source data repository platform. It began as a tool for collaborative data curation and sharing within research teams. Over time, features were added like dataset version control, APIs, and integration with other systems. The document outlines challenges around maintenance and sustainability. It also covers efforts to improve Dataverse's interoperability, such as integrating metadata standards and controlled vocabularies, and making datasets FAIR compliant. The goal is to establish Dataverse as a core component of the European Open Science Cloud by improving areas like software quality, integration with tools, and standardization.
Technical integration of data repositories status and challengesvty
This document discusses technical integration of data repositories, including:
- Previous integration initiatives focused on metadata integration using OAI-PMH and ResourceSync protocols, as well as aggregators like OpenAIRE.
- Challenges to integration include different levels of software/service maturity, maintenance of distributed applications, and use of common standards and vocabularies.
- Potential integration efforts could focus on improving FAIRness, metadata/data flexibility, and connections between repositories, software, and computing resources to better enable reuse of EOSC data and services.
Automated CI/CD testing, installation and deployment of Dataverse infrastruct...vty
This document summarizes an presentation about automating CI/CD testing, installation, and deployment of Dataverse in the European Open Science Cloud. It discusses using Docker and Kubernetes for deployment, a community-driven QA plan using pyDataverse for test automation, and providing quality assurance as a service. The presentation also covers topics like the CESSDA maturity model, integrating Dataverse on Google Cloud, and using serverless computing for some Dataverse applications and services.
Ontologies, controlled vocabularies and Dataversevty
Presentation on Semantic Web technologies for Dataverse Metadata Working Group running by Institute for Quantitative Social Science (IQSS) of Harvard University.
External controlled vocabularies support in Dataversevty
This presentation discusses adding support for external controlled vocabularies to the Dataverse data repository platform. It describes how ontologies like SKOS can be used to represent vocabularies and allow linking metadata fields in Dataverse to terms. The presentation proposes developing a Semantic Gateway plugin for Dataverse that would allow browsing and linking to external vocabularies hosted in the SKOSMOS framework via its API. This could improve metadata by allowing standardized, linked terms and help make data more FAIR.
Controlled vocabularies and ontologies in Dataverse data repositoryvty
This document discusses supporting external controlled vocabularies in Dataverse. It proposes implementing a JavaScript interface to allow linking metadata fields to terms from external vocabularies accessed via SKOSMOS APIs. Several challenges are identified, such as applying support to any field, backward compatibility, and ensuring vocabularies come from authoritative sources. Caching concepts and linking dataset files directly to terms are also proposed to improve interoperability.
Dataverse can be deployed using Docker containers to improve maintainability and portability. The document discusses how Docker can isolate applications and their dependencies into portable containers. It provides an example of deploying Dataverse as a set of microservices within Docker containers. Instructions are included on building Docker images, running containers, and managing the containers and images through commands and tools like Docker Desktop, Docker Hub, and Docker Compose.
CLARIN CMDI use case and flexible metadata schemes vty
Presentation for CLARIAH IG Linked Open Data on the latest developments for Dataverse FAIR data repository. Building SEMAF workflow with external controlled vocabularies support and Semantic API. Using the theory of inventive problem solving TRIZ for the further innovation in Linked Data.
Building COVID-19 Museum as Open Science Projectvty
This document discusses building a COVID-19 Museum as an open science project. It describes the speaker's background working on various data management projects. It discusses moving towards open science and sharing data according to FAIR principles. It outlines the Time Machine project for digitizing historical documents and its approach to data management. The rest of the document discusses using the Dataverse platform to build repositories, linking metadata to ontologies, using tools like Weblate for translations, and exploring the use of artificial intelligence and machine learning to enhance metadata and facilitate human-in-the-loop review processes.
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...vty
Presentation at ISKO Knowledge Organisation Research Observatory. RESEARCH REPOSITORIES AND DATAVERSE: NEGOTIATING METADATA, VOCABULARIES AND DOMAIN NEEDS
Building collaborative Machine Learning platform for Dataverse network. Lecture by Slava Tykhonov (DANS-KNAW, the Netherlands), DANS seminar series, 29.03.2022
Running Dataverse repository in the European Open Science Cloud (EOSC)vty
The document discusses Dataverse, an open source data repository software. It summarizes that Dataverse was developed by Harvard University, has a large community and development team, and is used by many countries as a data repository infrastructure. It then describes the SSHOC Dataverse project which aims to create a multilingual, standardized, and reusable open data infrastructure across several European countries. Finally, it notes that Dataverse is a reliable cloud service that enables FAIR data sharing and can be easily deployed by research organizations.
Presentation for CLARIAH IG Linked Open Data on the latest developments for Dataverse FAIR data repository. Building SEMAF workflow with external controlled vocabularies support and Semantic API.
Flexible metadata schemes for research data repositories - Clarin Conference...Vyacheslav Tykhonov
The development of the Common Framework in Dataverse and the CMDI use case. Building AI/ML based workflow for the prediction and linking concepts from external controlled vocabularies to the CMDI metadata values.
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and the...Andrea Scharnhorst
Presentation given at ISKO UK: research observatory, November 24, 2021
RESEARCH REPOSITORIES AND DATAVERSE: NEGOTIATING METADATA, VOCABULARIES AND DOMAIN NEEDS
Vyacheslav Tykhonov, Jerry de Vries, Eko Indarto, Femmy Admiraal, Mike Priddy, and Andrea Scharnhorst: Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and the DANS EASY Research Data Repository
Abstract:
The development of metadata schemes in data repositories (and other content providers) has always been a process of negotiation between the needs of the designated user communities and the content of the collection on the one side and standards developed in the field. Automatisation has both enabled and enforced standardisation and alignment of metadata schemes (see as an example). But, while designated user communities turned from being local users to global ones (due to web services), their specific needs have not vanished. Technology offers possibilities to give the aforementioned negotiation a new form. In this presentation, we present the Dataverse platform, used by many data repositories. We show - using the case of the CMDI metadata and the CLARIN (Common Language Resources and Technology Infrastructure)community - how the Dataverse common core set of metadata called Citation Block can be extended with custom fields defined as a discipline specific metadata block. In particular, we show how these custom fields can be connected to a distributed network of authoritative controlled vocabularies. So, that at the end semantic search is possible. The presentation highlights opportunities and challenges, based on our own experiences. Related work has been presented at the CLARIN Annual Conference 2021 (see Proceedings).
SSHOC Dataverse in the European Open Science Cloudvty
This project summary covers the SSHOC project which aims to create a social sciences and humanities section of the European Open Science Cloud by maximizing data reuse through open science principles. The project will interconnect existing and new infrastructures through a clustered cloud, establish governance for SSH-EOSC, and provide a research data repository service for SSH institutions through further developing the Dataverse platform on EOSC. The project involves 47 partners across 20 beneficiaries and 27 linked third parties with a budget of €14,455,594.08 over 40 months to achieve these objectives.
This webinar in the course of the LOD2 webinar series will present use cases and live demos of D2R (Free University Berlin) and Sparqlify (University of Leipzig).
D2R Server is a tool for publishing relational databases on the Semantic Web. It enables RDF and HTML browsers to navigate the content of the database, and allows applications to query the database using the SPARQL query language.
Sparqlify is a tool enabling one to define expressive RDF views on relational databases and query them with a subset of the SPARQL query language. By featuring a novel RDF view definition syntax, it aims at simplifying the RDB-RDF mapping process.
more to be found at:
The document summarizes key aspects of the gCube architecture and its approaches to enabling interoperability across heterogeneous resources and communities. gCube aims to hide heterogeneity by providing unified access to diverse resources, and embrace heterogeneity by supporting multiple protocols and models. It utilizes various approaches like blackboard communication, wrappers, and adapters to achieve interoperability for resource discovery, content access, data discovery, process execution, and security. Applications like AquaMaps and time series management are also discussed.
D4Science scientific data infrastructure promoting interoperability by embrac...FAO
D4Science aims to promote interoperability across different data sources and computing platforms by embracing heterogeneity. It provides solutions to abstract over differences in location, protocols, and models, allowing resources to be virtually accessible in a common ecosystem. D4Science offers various tools and patterns to deliver interoperability, connecting heterogeneous digital content, repositories, and computation platforms. This reduces costs and barriers for users while supporting many existing standards.
The document introduces several data repositories for sharing research data including Lincoln Centre for Autonomous Systems (LCAS), Mendeley Data, Nasdaq Data Link, figshare, Harvard Dataverse, Dryad Digital Repository, and Network Repository. It provides a brief description of each repository, highlighting their research areas, data types, file restrictions, and website URLs.
The best way to publish and share research data is with a research data repository. A repository is an online database that allows research data to be preserved across time and helps others find it.
Controlled vocabularies and ontologies in Dataverse data repositoryvty
This document discusses supporting external controlled vocabularies in Dataverse. It proposes implementing a JavaScript interface to allow linking metadata fields to terms from external vocabularies accessed via SKOSMOS APIs. Several challenges are identified, such as applying support to any field, backward compatibility, and ensuring vocabularies come from authoritative sources. Caching concepts and linking dataset files directly to terms are also proposed to improve interoperability.
Dataverse can be deployed using Docker containers to improve maintainability and portability. The document discusses how Docker can isolate applications and their dependencies into portable containers. It provides an example of deploying Dataverse as a set of microservices within Docker containers. Instructions are included on building Docker images, running containers, and managing the containers and images through commands and tools like Docker Desktop, Docker Hub, and Docker Compose.
CLARIN CMDI use case and flexible metadata schemes vty
Presentation for CLARIAH IG Linked Open Data on the latest developments for Dataverse FAIR data repository. Building SEMAF workflow with external controlled vocabularies support and Semantic API. Using the theory of inventive problem solving TRIZ for the further innovation in Linked Data.
Building COVID-19 Museum as Open Science Projectvty
This document discusses building a COVID-19 Museum as an open science project. It describes the speaker's background working on various data management projects. It discusses moving towards open science and sharing data according to FAIR principles. It outlines the Time Machine project for digitizing historical documents and its approach to data management. The rest of the document discusses using the Dataverse platform to build repositories, linking metadata to ontologies, using tools like Weblate for translations, and exploring the use of artificial intelligence and machine learning to enhance metadata and facilitate human-in-the-loop review processes.
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...vty
Presentation at ISKO Knowledge Organisation Research Observatory. RESEARCH REPOSITORIES AND DATAVERSE: NEGOTIATING METADATA, VOCABULARIES AND DOMAIN NEEDS
Building collaborative Machine Learning platform for Dataverse network. Lecture by Slava Tykhonov (DANS-KNAW, the Netherlands), DANS seminar series, 29.03.2022
Running Dataverse repository in the European Open Science Cloud (EOSC)vty
The document discusses Dataverse, an open source data repository software. It summarizes that Dataverse was developed by Harvard University, has a large community and development team, and is used by many countries as a data repository infrastructure. It then describes the SSHOC Dataverse project which aims to create a multilingual, standardized, and reusable open data infrastructure across several European countries. Finally, it notes that Dataverse is a reliable cloud service that enables FAIR data sharing and can be easily deployed by research organizations.
Presentation for CLARIAH IG Linked Open Data on the latest developments for Dataverse FAIR data repository. Building SEMAF workflow with external controlled vocabularies support and Semantic API.
Flexible metadata schemes for research data repositories - Clarin Conference...Vyacheslav Tykhonov
The development of the Common Framework in Dataverse and the CMDI use case. Building AI/ML based workflow for the prediction and linking concepts from external controlled vocabularies to the CMDI metadata values.
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and the...Andrea Scharnhorst
Presentation given at ISKO UK: research observatory, November 24, 2021
RESEARCH REPOSITORIES AND DATAVERSE: NEGOTIATING METADATA, VOCABULARIES AND DOMAIN NEEDS
Vyacheslav Tykhonov, Jerry de Vries, Eko Indarto, Femmy Admiraal, Mike Priddy, and Andrea Scharnhorst: Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and the DANS EASY Research Data Repository
Abstract:
The development of metadata schemes in data repositories (and other content providers) has always been a process of negotiation between the needs of the designated user communities and the content of the collection on the one side and standards developed in the field. Automatisation has both enabled and enforced standardisation and alignment of metadata schemes (see as an example). But, while designated user communities turned from being local users to global ones (due to web services), their specific needs have not vanished. Technology offers possibilities to give the aforementioned negotiation a new form. In this presentation, we present the Dataverse platform, used by many data repositories. We show - using the case of the CMDI metadata and the CLARIN (Common Language Resources and Technology Infrastructure)community - how the Dataverse common core set of metadata called Citation Block can be extended with custom fields defined as a discipline specific metadata block. In particular, we show how these custom fields can be connected to a distributed network of authoritative controlled vocabularies. So, that at the end semantic search is possible. The presentation highlights opportunities and challenges, based on our own experiences. Related work has been presented at the CLARIN Annual Conference 2021 (see Proceedings).
SSHOC Dataverse in the European Open Science Cloudvty
This project summary covers the SSHOC project which aims to create a social sciences and humanities section of the European Open Science Cloud by maximizing data reuse through open science principles. The project will interconnect existing and new infrastructures through a clustered cloud, establish governance for SSH-EOSC, and provide a research data repository service for SSH institutions through further developing the Dataverse platform on EOSC. The project involves 47 partners across 20 beneficiaries and 27 linked third parties with a budget of €14,455,594.08 over 40 months to achieve these objectives.
This webinar in the course of the LOD2 webinar series will present use cases and live demos of D2R (Free University Berlin) and Sparqlify (University of Leipzig).
D2R Server is a tool for publishing relational databases on the Semantic Web. It enables RDF and HTML browsers to navigate the content of the database, and allows applications to query the database using the SPARQL query language.
Sparqlify is a tool enabling one to define expressive RDF views on relational databases and query them with a subset of the SPARQL query language. By featuring a novel RDF view definition syntax, it aims at simplifying the RDB-RDF mapping process.
more to be found at:
The document summarizes key aspects of the gCube architecture and its approaches to enabling interoperability across heterogeneous resources and communities. gCube aims to hide heterogeneity by providing unified access to diverse resources, and embrace heterogeneity by supporting multiple protocols and models. It utilizes various approaches like blackboard communication, wrappers, and adapters to achieve interoperability for resource discovery, content access, data discovery, process execution, and security. Applications like AquaMaps and time series management are also discussed.
D4Science scientific data infrastructure promoting interoperability by embrac...FAO
D4Science aims to promote interoperability across different data sources and computing platforms by embracing heterogeneity. It provides solutions to abstract over differences in location, protocols, and models, allowing resources to be virtually accessible in a common ecosystem. D4Science offers various tools and patterns to deliver interoperability, connecting heterogeneous digital content, repositories, and computation platforms. This reduces costs and barriers for users while supporting many existing standards.
The document introduces several data repositories for sharing research data including Lincoln Centre for Autonomous Systems (LCAS), Mendeley Data, Nasdaq Data Link, figshare, Harvard Dataverse, Dryad Digital Repository, and Network Repository. It provides a brief description of each repository, highlighting their research areas, data types, file restrictions, and website URLs.
The best way to publish and share research data is with a research data repository. A repository is an online database that allows research data to be preserved across time and helps others find it.
The document presents an artificial intelligence system to analyze medical papers on COVID-19 from the CORD dataset. The system applies natural language processing techniques like named entity recognition using PubMedBERT to extract entities from paper abstracts. It stores the extracted semi-structured entity-relation data in a NoSQL database to enable complex queries and insights. The system also uses knowledge graphs and text-based emotion detection to analyze changes in medical treatment over time and combined treatment strategies mentioned in the papers.
The document summarizes a talk given by Dr. Johannes Keizer on the CIARD (Coherence in Information for Agricultural Research for development) initiative and a global infrastructure for linked open data (LOD). The CIARD initiative aims to provide open access to agricultural research by promoting standards and sharing information. It involves institutions contributing their research outputs through the CIARD RING and adopting standards. The infrastructure proposed includes distributed repositories linked through vocabularies and LOD. Tools are being developed to generate LOD and link datasets through shared concepts.
This is one out of a series of presentations which I have given during a recent trip to the United States. I will make them all public, but content does not vary a lot between some of them
Foundations for the future of science discusses using artificial intelligence and machine learning to advance scientific research. Key points discussed include using AI to analyze large datasets, develop scientific models, and automate experimental workflows. The document also outlines several examples of how the Globus data platform is currently enabling AI-powered scientific applications across multiple domains. Overall, the document advocates that embracing "AI for science" has the potential to accelerate scientific discovery by overcoming limitations in human analysis capabilities and computational resources.
The current status of Linked Open Data (LOD) shows evidence of many datasets available on the Web in RDF. In the meantime, there are still many challenges to overcome by organizations in their journey of publishing five stars datasets on the Web. Those challenges are not only technical, but are also organizational. At this moment where connectionist AI is gaining a wave of popularity with many applications, LOD needs to go beyond the guarantee of FAIR principles. One direction is to build a sustainable LOD ecosystem with FAIR-S principles. In parallel, LOD should serve as a catalyzer for solving societal issues (LOD for Social Good) and personal empowerment through data (Social Linked Data).
Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)Joachim Neubert
Wikidata has been used sucessfully as a linking hub for authoritiy files. Knowledge organization systems like thesauri or classifications are more complex and pose additional challenges.
The document discusses several reputable sources for finding reliable datasets for academic research, including research data repositories like Mendeley Data, Dryad Digital Repository, and Harvard Dataverse. It provides details on the types of data available and fields covered for repositories like LCAS, Nasdaq Data Link, figshare, and Network Repository. The document emphasizes ensuring compliance with terms of use and applicable laws when using datasets from these resources.
This document discusses challenges and opportunities around discovering and using open government data. It notes that simply publishing data as linked data is not enough, and that metadata standards and presentation methods are needed to aid discovery and use. It highlights work done by Tetherless World Constellation to apply metadata standards to describe government datasets and create an aggregated catalog of over 1 million datasets. The use of schema.org and other semantic markup is discussed to enable search engines to more easily parse and index government data catalogs. Federation of catalogs using APIs and standards like DCAT and CKAN is also covered. The document emphasizes that exposing metadata is key to getting government data discovered.
The document discusses how semantic web technologies like ontologies, linked open data, and knowledge graphs can benefit cultural heritage applications. It provides ArCo, an Italian cultural heritage knowledge graph, as an example. ArCo converts Italian cultural property catalog records from XML to RDF, and uses ontologies to represent cultural heritage concepts and logically connect data from different sources. This enriches data and improves interoperability. The document outlines how ArCo can be explored through its SPARQL endpoint, visualizations, and other tools.
The document provides an introduction to Prof. Dr. Sören Auer and his background in knowledge graphs. It discusses his current role as a professor and director focusing on organizing research data using knowledge graphs. It also briefly outlines some of his past roles and major scientific contributions in the areas of technology platforms, funding acquisition, and strategic projects related to knowledge graphs.
The document discusses the evolution and history of the Internet and the Research Data Alliance (RDA). It provides details on:
- How the Internet originated from research networks developed by DARPA in the 1960s-70s.
- The RDA aims to build bridges for open sharing of research data globally by facilitating collaboration between experts. It is supported by funding from the EC, Australian NSD, and US NSF/NIST.
- The RDA works through Working and Interest Groups that develop standards and recommendations to advance data sharing at biannual plenary meetings. Several outputs addressing issues like metadata standards, data type registries, and PID information are expected in 2014.
This document discusses the potential for developing a knowledge network by leveraging metadata from scientific endeavors. It begins by outlining some of the limitations of traditional metadata approaches. It then proposes that metadata could be structured as a graph using semantic triples to represent relationships between people, institutions, projects, and other elements. This liberalized metadata approach could help reduce complexity while providing a more comprehensive view of scientific activities and outputs. The document advocates for establishing common standards, developing tools to extract and aggregate metadata, and creating services and repositories to enable discovery, analysis, and visualization of the knowledge network. The goal is to facilitate research by providing integrated access to information on scientific data, publications, actors and their relationships.
The document discusses challenges with querying decentralized knowledge graphs. It begins by defining data and knowledge, and how combining them provides meaning. It then discusses how public SPARQL endpoints for knowledge graphs like DBpedia and Wikidata often do not allow querying the full graph and returning complete results due to performance and scalability issues. Federated query engines have been developed to query multiple knowledge graphs at once, but still face challenges with complex queries involving property paths or aggregations over large datasets.
Presented at the Northern Ohio Technical Services Librarians' meeting, November 22, 2013. Describes why libraries should move toward a linked data future to enable their resources to be discoverable on the open web, and includes lessons learned from developing the eXtensible Catalog at the University of Rochester.
From data portal to knowledge portal: Leveraging semantic technologies to sup...Xiaogang (Marshall) Ma
The document discusses the Deep Carbon Virtual Observatory (DCvO), which aims to provide integrated access to scientific data and information related to deep carbon science. It describes the DCvO's architecture, which leverages semantic technologies and ontologies to link different types of data and entities. The DCvO provides visualization tools to discover information by clicking through linked concepts, facilitates potential collaborations, and acts as an integrated portal for diverse scientific content. It also discusses some of the DCvO's boundary activities, like linking datasets to publications and integrating with external data repositories.
Similar to Building COVID-19 Knowledge Graph at CoronaWhy (20)
Decentralised identifiers and knowledge graphs vty
Building an Operating System for Open Science: data integration challenges, Dataverse data repository and knowledge graphs. Lecture by Slava Tykhonov, DANS-KNAW, for the Journées Scientifiques de Rochebrune 2023 (JSR'23).
Decentralised identifiers for CLARIAH infrastructure vty
Slides of the presentation for CLARIAH community on the ideas how to make controlled vocabularies sustainable and FAIR (Findable, Accessible, Interoperable, Reusable) with the help of Decentralized Identifiers (DIDs).
Dataverse repository for research data in the COVID-19 Museumvty
The Covid-19 Museum has an ambition to create a platform to deposit, consult, aggregate and study heterogeneous data about the pandemics using features of a distributed web service. To achieve this purpose, Dataverse has been selected as a reliable FAIR data repository with built-in search engine and functionality that allows adding computing resources to explore archived resources both on data and metadata. Presentation by
Slava Tykhonov, DANS-KNAW (The Royal Netherlands Academy of Arts and Sciences). Université Paris Cité, 19 April 2022.
Flexible metadata schemes for research data repositories - CLARIN Conference'21vty
The development of the Common Framework in Dataverse and the CMDI use case. Building AI/ML based workflow for the prediction and linking concepts from external controlled vocabularies to the CMDI metadata values.
Data standardization process for social sciences and humanitiesvty
This document discusses data standardization processes at DANS-KNAW. It describes how DANS-KNAW standardizes metadata during data deposit and harvesting through controlled vocabularies. It also discusses how DANS is developing the SSHOC DataverseEU project to standardize metadata across several European countries. The document concludes by emphasizing the importance of tracking provenance information and developing standardization pipelines and services to improve data and metadata access.
This document summarizes the development process for the DataverseSSHOC project. It outlines two parallel development tracks - a core development team modifying the Dataverse core functionality, and an application development team creating new tools. Tasks are managed using Trello. Code is stored in GitHub and BitBucket. The development follows a SCRUM process with Docker images available on Docker Hub and a Kubernetes cluster for deployment. Testing includes unit, integration, performance, and A/B testing to comply with CESSDA maturity standards.
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
The cost of acquiring information by natural selectionCarl Bergstrom
This is a short talk that I gave at the Banff International Research Station workshop on Modeling and Theory in Population Biology. The idea is to try to understand how the burden of natural selection relates to the amount of information that selection puts into the genome.
It's based on the first part of this research paper:
The cost of information acquisition by natural selection
Ryan Seamus McGee, Olivia Kosterlitz, Artem Kaznatcheev, Benjamin Kerr, Carl T. Bergstrom
bioRxiv 2022.07.02.498577; doi: https://doi.org/10.1101/2022.07.02.498577
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...Scintica Instrumentation
Targeting Hsp90 and its pathogen Orthologs with Tethered Inhibitors as a Diagnostic and Therapeutic Strategy for cancer and infectious diseases with Dr. Timothy Haystead.
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills MN
By harnessing the power of High Flux Vacuum Membrane Distillation, Travis Hills from MN envisions a future where clean and safe drinking water is accessible to all, regardless of geographical location or economic status.
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfSelcen Ozturkcan
Ozturkcan, S., Berndt, A., & Angelakis, A. (2024). Mending clothing to support sustainable fashion. Presented at the 31st Annual Conference by the Consortium for International Marketing Research (CIMaR), 10-13 Jun 2024, University of Gävle, Sweden.
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Building COVID-19 Knowledge Graph at CoronaWhy
1. CoronaWhy Knowledge Graph
Slava Tykhonov
Senior Information Scientist
(DANS-KNAW, the Netherlands)
W3C Semantic Web Health Care and Life Sciences Interest Group webinar
06.10.2020
5. COVID-19 Open Research Dataset Challenge (CORD-19)
It’s all started from this (March, 2020):
“In response to the COVID-19 pandemic and with the view to boost research, the Allen
Institute for AI together with CZI, MSR, Georgetown, NIH & The White House is collecting
and making available for free the COVID-19 Open Research Dataset (CORD-19). This
resource is updated weekly and contains over 287,000 scholarly articles, including 82,870
with full text, about COVID-19 and other viruses of the coronavirus family.” (Kaggle)
5
6. CoronaWhy Community Tasks (March-April)
1. Task-Risk helps to identify risk factors that can increase the chance of being
infected, or affects the severity or the survival outcome of the infection
2. Task-Ties to explore transmission, incubation and environment stability
3. Match Clinical Trials allows exploration of the results from the COVID-19
International Clinical Trials dataset
4. COVID-19 Literature Visualization helps to explore the data behind the AI-
powered literature review
5. Named Entity Recognition across the entire corpus of CORD-19 papers with full
text
6
8. Collaboration with other organizations
● Harvard Medical School, INDRA integration
● Helix Group, Stanford University
● NASA JPL, COVID-19 knowledge graph and GeoParser
● Kaggle, coronamed application
● Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS,
knowledge graph
● dcyphr, a platform for creating and engaging with distillations of academic
articles
● CAMARADES (Collaborative Approach to Meta-Analysis and Review of
Animal Data from Experimental Studies)
We’ve got almost endless data streams...
8
10. Building a horizontal platform to serve vertical teams
Source: CoronaWhy infrastructure introduction 10
11. Turning FAIR into reality!
11
DANS-KNAW is one of worldwide leaders in FAIR Data (FAIRsFAIR)
12. Dataverse as data integration point
12
● Available as a service for the
community from April, 2020
● Used by CoronaWhy vertical
teams for the data exchange and
share
● Intended to help researchers to
make their data FAIR
● One of the biggest COVID-19 data
archives in the world with 700k
files
● New teams are getting own data
containers and can reuse data
collected and produced by others
https://datasets.coronawhy.org
14. COVID-19 data files verification
14
We do a verification of every file by
importing its contents to dataframe.
All column names (variables)
extracted from tabular data available
as labels in files metadata
We’ve enabled Dataverse data
previewers to browse through the
content of files without download!
We’re starting internal challenges to
build ML models for the metadata
classification
15. Dataverse content in Jupyter notebooks
15
Source: Dataverse examples on Google Colabs
16. COVID-19 Data Crowdsourcing
CoronaWhy data management
team does does the review of all
harvested datasets and try to
identify the important data.
We’re approaching github owners
by creating issues in their repos
and inviting them to help us.
More than 20% of data owners
joining CoronaWhy community or
interested to curate their datasets.
Bottom-up data collection works!
16
17. Challenge of data integration and various ontologies
CORD-19 collection workflows with NLP pipeline:
● manual annotation and labelling of COVID-19 related papers
● automatic entity extraction and classification of text fragments
● statements extraction and curation
● linking papers to specific research questions with relationships extraction
Dataverse Data Lake streaming COVID-19 datasets from various sources:
● medical data
● socio-economic Data
● political data and statistics
17
18. The importance of standards and ontologies
Generic controlled vocabularies to link metadata in the bibliographic collections are well
known: ORCID, GRID, GeoNames, Getty.
Medical knowledge graphs powered by:
● Biological Expression Language (BEL)
● Medical Subject Headings (MeSH®) by U.S. National Library of Medicine (NIH)
● Wikidata (Open ontology) - Wikipedia
Integration based on metadata standards:
● MARC21, Dublin Core (DC), Data Documentation Initiative (DDI)
18
19. Statements extraction with INDRA
19
Source: EMMAA (Ecosystem of Machine-maintained Models with Automated Assembly)
“INDRA (Integrated Network and Dynamical
Reasoning Assembler) is an automated
model assembly system, originally
developed for molecular systems biology
and currently being generalized to other
domains.”
Developed as a part of Harvard Program in
Therapeutic Science and the Laboratory of
Systems Pharmacology at Harvard Medical
School.
http://indra.bio
21. Advanced Text Analysis (NLP pipeline)
Source: D.Shlepakov, How to Build Knowledge Graph, notebook
We need a good understanding of all domain specific texts to make right statements:
21
22. Building domain specific knowledge graphs
● We’re collecting all possible COVID-19 data and archiving in our Dataverse
● Looking for various related controlled vocabularies and ontologies
● Building and reusing conversion pipelines to get all data values linked in RDF format
The ultimate goal is to automate the process of the Knowledge extraction by using the latest
developments in Artificial Intelligence and Deep Learning. We need to organize all available
knowledge in a common way.
22
23. Simple Knowledge Organization System (SKOS)
SKOS models a thesauri-like resources:
- skos:Concepts with preferred labels and alternative labels (synonyms) attached to them
(skos:prefLabel, skos:altLabel).
- skos:Concept can be related with skos:broader, skos:narrower and skos:related properties.
- terms and concepts could have more than one broader term and concept.
SKOS allows to create a semantic layer on top of objects, a network with statements and relationships.
A major difference of SKOS is logical “is-a hierarchies”. In thesauri the hierarchical relation can represent
anything from “is-a” to “part-of”.
23
24. RDF graph using the SKOS Core Vocabulary
24
Source: SKOS Core Guide
25. Global Research Identifier Database (GRID) in SKOS
25
We already have a lot of data in
CoronaWhy Data Lake.
Can we provide human a
convenient web interface to
create links to data points?
Can we use Machine Learning
algorithms to make a prediction
about links and convert data in
SKOS automatically?
26. SKOSMOS framework to discover ontologies
26
● SKOSMOS is developed in
Europe by the National Library
of Finland (NLF)
● active global user community
● search and browsing interface
for SKOS concept
● multilingual vocabularies
support
● used for different use cases
(publish vocabularies, build
discovery systems, vocabulary
visualization)
27. Skosmos pipeline
27
Skosmos backend is based on Jena Fuseki, a SPARQL server and RDF triple store integrated with
Lucene search engine. The jena-text extension can be used for faster text search and used both by
Skosmos GUI and REST API.
28. MeSH in SKOSMOS
SKOS version of
Medical Subjects
Headings provided by
Finnish Thesaurus and
Ontology Service
(Finto):
https://finto.fi/mesh/
28
31. COVID-19 expert questions
31
Source: Epidemic Questions Answering
“In response to the COVID-19 pandemic, the Epidemic Question Answering (EPIC-QA) track challenges teams to develop
systems capable of automatically answering ad-hoc questions about the disease COVID-19, its causal virus SARS-CoV-2,
related corona viruses, and the recommended response to the pandemic. While COVID-19 has been an impetus for a
large body of emergent scientific research and inquiry, the response to COVID-19 raises questions for consumers.”
34. Visual graph of COVID-19 dataset
34
Source: CoronaWhy GraphDB
35. COVID-19 questions in datasets metadata
35
Source: COVID-19 European data hub in Harvard Dataverse
● COVID-19 ontologies can be hosted by
SKOSMOS framework
● Researchers can enrich metadata by
adding standardized questions provided
by SKOSMOS ontologies
● rich metadata exported back to Linked
Open Data Cloud to increase a chance
to be found
● enriched metadata can be used for
further ML models training
38. Do you know a lot of people that can use SPARQL?
38
Source: Semantic annotation of the Laboratory Chemical Safety Summary in PubChem
39. CLARIAH conclusions
“By developing these decentralised, yet controlled Knowledge Graph development practices we have
contributed to increasing interoperability in the humanities and enabling new research opportunities to a
wide range of scholars. However, we observe that users without Semantic Web knowledge find these
technologies hard to use, and place high value in end-user tools that enable engagement.
Therefore, for the future we emphasise the importance of tools to specifically target the goals of concrete
communities – in our case, the analytical and quantitative answering of humanities research questions for
humanities scholars. In this sense, usability is not just important in a tool context; in our view, we need to
empower users in deciding under what models these tools operate.” (CLARIAH: Enabling
Interoperability Between Humanities Disciplines with Ontologies)
Chicken-egg problem: users are building tools without data models and ontologies but in
reality they need to build a knowledge graph with common ontologies first!
39
40. Linked Data integration challenges
● datasets are very heterogeneous and multilingual
● data usually lacks sufficient data quality control
● data providers using different modeling schemas and styles
● linked data cleansing and versioning is very difficult to track and maintain
properly, web resources aren’t persistent
● even modern data repositories providing only metadata records describing
data without giving access to individual data items stored in files
● difficult to assign and manually keep up-to-date entity relationships in
knowledge graph
CoronaWhy has too much information streams that seems to be impossible to
integrate and give back to COVID-19 researchers. So, do we have a solution? 40
41. Bibliographic Framework (BIBFRAME) as a Web of Data
“The Library of Congress officially launched its Bibliographic Framework Initiative
in May 2011. The Initiative aims to re-envision and, in the long run, implement a
new bibliographic environment for libraries that makes "the network" central and
makes interconnectedness commonplace.”
“Instead of thousands of catalogers repeatedly describing the same resources, the
effort of one cataloger could be shared with many.” (Source)
In 2019 BIBFRAME 2.0, the Library of Congress Pilot, was announced.
Let’s take a journey and move from domain specific ontology to bibliographic!
41
42. BIBFRAME 2.0 concepts
42
● Work. The highest level of abstraction, a Work, in the BIBFRAME context,
reflects the conceptual essence of the cataloged resource: authors, languages,
and what it is about (subjects).
● Instance. A Work may have one or more individual, material embodiments, for
example, a particular published form. These are Instances of the Work. An
Instance reflects information such as its publisher, place and date of publication,
and format.
● Item. An item is an actual copy (physical or electronic) of an Instance. It reflects
information such as its location (physical or virtual), shelf mark, and barcode.
● Agents: Agents are people, organizations, jurisdictions, etc., associated with a
Work or Instance through roles such as author, editor, artist, photographer,
composer, illustrator, etc.
● Subjects: A Work might be “about” one or more concepts. Such a concept is
said to be a “subject” of the Work. Concepts that may be subjects include topics,
places, temporal expressions, events, works, instances, items, agents, etc.
● Events: Occurrences, the recording of which may be the content of a Work.
Source: the Library of Congress
43. MARC as a foundation of the structured Data Hub
43
Source: the Library of Congress, USA
MARC standard was developed in the 1960s to create records that
could be read by computers and shared among libraries. The term
MARC is an abbreviation for MAchine Readable Cataloging.
The MARC 21 bibliographic format was created for the international
community. It’s very rich, with more than 2,000 data elements
defined!
It’s identified by its ISO (International Standards Organization)
number: ISO 2709
44. How to integrate data in the common KG?
● Use MARC 21 as a basis for all bibliographic and authority records
● All controlled vocabularies should be expressed in MARC 21 format for Authority
Data, we need to build an authority linking process with the “human in the loop”
approach that will allow to verify AI predicted links.
● Different MARC 21 fields could be linked to the different ontologies and/or even
interlinked. For example, we can get some entities linked to both MeSH and
Wikidata in the same bibliographic record to increase the interoperability of the
Knowledge Graph.
● Every CORD-19 paper can get a metadata enrichment provided by any research
team working on the NLP extraction of entities, relations or linking CV together.
44
45. COVID-19 paper in MARC 21 representation
45
Structure of the bibliographic record:
● authority records contain information
about authors and affiliation
● Medical entities extracted by NLP
pipeline interlinked in 650x fields
● part of metadata fields generated and
filled by Machine Learning models,
part contributed by human experts
● provenance information kept in 833x
fields indicating fully or partially
machine-generated records
● Relations between entities stored in
730x fields
Source: CoronaWhy CORD-19 portal
50. CLARIAH and Network Digital Heritage (NDE) GraphiQL
50
CoronaWhy is running own instance of NDE on Kubernetes cluster and maintains the support of
another ontologies (MeSH, ICD, NDF) available via their SPARQL endpoints!
If you want to query API:
curl "http://nde.dataverse-
dev.coronawhy.org/nde/graphql?query=%20%7B%20terms(match%3A%22COVID%22%2Cdataset%3A%5B%22wikidata%22%5D)%20
%7B%20dataset%20terms%20%7Buri%2C%20altLabel%7D%20%7D%20%7D"
51. Increasing Dataverse metadata interoperability
51
External controlled vocabularies support contributed by SSHOC project (data infrastructure for the EOSC)
52. Hypothes.is annotations as a peer review service
1. AI pipeline does
domain specific
entities extraction
and ranking of
relevant CORD-19
papers.
2. Automatic entities
and statements will
be added, important
fragments should be
highlighted.
3. Human annotators
should verify results
and validate all
statements.
52
54. Building an Operating System for Open Science
54
CoronaWhy Common Research and Data Infrastructure is distributed and robust
enough to be scaled up and reused for other tasks like cancer research
All services are build from Open Source components
Data processed and published in FAIR way, the provenance information is the
part of our Data Lake
Data evaluation and credibility is the top priority, we’re providing tools for the
expert community for the verification of our datasets
The transparency of data and services guarantees the reproducibility of all
experiments and get bring new insights in COVID-19 research
55. CoronaWhy Common Research and Data Infrastructure
Data preprocessing pipeline implemented on Jupyter notebook Docker with extra modules.
Dataverse as data repository to store data from automatic and curated workflows.
Elasticsearch has CORD-19 indexes on sections and sentences level with spacy
enrichments. Other indexes: MeSH, Geonames, GRID.
Hypothesis and Doccano annotation services to annotate publications
Virtuoso and GraphDB with public SPARQL endpoints to query COVID-19 Knowledge Graph
Other services: Colab, MongoDB, Kibana, BEL Commons 3.0, INDRA, Geoparser, Tabula
https://github.com/CoronaWhy/covid-19-infrastructure
55