Data discovery through federated dataset catalogsValeria Pesce
The document discusses data discovery through federated dataset catalogs. It notes that there are many institutional and thematic catalogs that need to be searched. Federated metadata catalogs and secondary catalogs can help by allowing searches across multiple primary catalogs. Good metadata practices at both the primary catalog and secondary catalog levels are important for discovery, including using open, interoperable standards and vocabularies. The document examines current practices for agricultural data discovery and identifies opportunities for improving metadata quality and standards to enhance discovery across institutional boundaries.
Semantic challenges in sharing dataset metadata and creating federated datase...Valeria Pesce
This document discusses the semantic challenges involved in sharing metadata about datasets and creating federated catalogs of datasets. It outlines the types of semantics needed to fully describe datasets, including vocabularies for topics, file formats, data structures, and geographic coverage. It provides examples of how different dataset catalogs use semantics inconsistently. The CIARD RING project aims to address these challenges by developing a common metadata application profile and linking dataset metadata to controlled vocabularies and ontologies using Linked Open Data principles. This allows queries across datasets from different sources to leverage the semantic mappings.
Dataset description: DCAT and other vocabulariesValeria Pesce
This document discusses metadata needed to describe datasets for applications to find and understand them when stored in data catalogs or repositories. It examines existing dataset description vocabularies like DCAT and their limitations in fully capturing necessary metadata.
Key points made:
- Machine-readable metadata is important for datasets to be discoverable and usable by applications when stored across repositories.
- Metadata should describe the dataset, distributions, dimensions, semantics, protocols/APIs, subsets etc.
- Vocabularies like DCAT provide some metadata but don't fully cover dimensions, semantics, protocols/APIs or subsets.
- No single vocabulary or data catalog solution currently provides all necessary metadata for full semantic interoperability.
How to describe a dataset. Interoperability issuesValeria Pesce
Presented by Valeria Pesce during the pre-meeting of the Agricultural Data Interoperability Interest Group (IGAD) of the Research Data Alliance (RDA), held on 21 and 22 September 2015 in Paris at INRA.
This document summarizes a session from the Force 11 Scholarly Communications Institute Summer School on data discovery. The session covered metadata, including what it is, types of metadata, and standards. It discussed how people search for and find data through various sources. The session also explored the FAIR data principles of findable, accessible, interoperable and reusable data and had breakout groups discuss applying these principles in practice.
An introduction to the FAIR principles and a discussion of key issues that must be addressed to ensure data is findable, accessible, interoperable and re-usable. The session explored the role of the CDISC and DDI standards for addressing these issues.
Presented by Gareth Knight at the ADMIT Network conference, organised by the Association for Data Management in the Tropics, in Antwerp, Belgium on December 1st 2015.
A presentation of the Dutch Techcentre for Life Sciences FAIR Data ecosystem given at the BlueBridge workshop, a pre-event of the Research Data Alliance's 9th Plenary
This document discusses the development of the DATS (Data Tag Suite), which is needed for DataMed to index data sources in a scalable way, similar to how JATS indexes literature for PubMed. The DATS model was developed through a community-driven process involving use cases and existing metadata schemas. It includes core and extended elements to describe datasets and other digital research objects. The model is designed around the dataset entity and serialized in JSON and JSON-LD mapped to schema.org to increase visibility, accessibility, and searchability. Efforts are ongoing to further align DATS with schema.org and integrate it with related metadata standards and tools.
Data discovery through federated dataset catalogsValeria Pesce
The document discusses data discovery through federated dataset catalogs. It notes that there are many institutional and thematic catalogs that need to be searched. Federated metadata catalogs and secondary catalogs can help by allowing searches across multiple primary catalogs. Good metadata practices at both the primary catalog and secondary catalog levels are important for discovery, including using open, interoperable standards and vocabularies. The document examines current practices for agricultural data discovery and identifies opportunities for improving metadata quality and standards to enhance discovery across institutional boundaries.
Semantic challenges in sharing dataset metadata and creating federated datase...Valeria Pesce
This document discusses the semantic challenges involved in sharing metadata about datasets and creating federated catalogs of datasets. It outlines the types of semantics needed to fully describe datasets, including vocabularies for topics, file formats, data structures, and geographic coverage. It provides examples of how different dataset catalogs use semantics inconsistently. The CIARD RING project aims to address these challenges by developing a common metadata application profile and linking dataset metadata to controlled vocabularies and ontologies using Linked Open Data principles. This allows queries across datasets from different sources to leverage the semantic mappings.
Dataset description: DCAT and other vocabulariesValeria Pesce
This document discusses metadata needed to describe datasets for applications to find and understand them when stored in data catalogs or repositories. It examines existing dataset description vocabularies like DCAT and their limitations in fully capturing necessary metadata.
Key points made:
- Machine-readable metadata is important for datasets to be discoverable and usable by applications when stored across repositories.
- Metadata should describe the dataset, distributions, dimensions, semantics, protocols/APIs, subsets etc.
- Vocabularies like DCAT provide some metadata but don't fully cover dimensions, semantics, protocols/APIs or subsets.
- No single vocabulary or data catalog solution currently provides all necessary metadata for full semantic interoperability.
How to describe a dataset. Interoperability issuesValeria Pesce
Presented by Valeria Pesce during the pre-meeting of the Agricultural Data Interoperability Interest Group (IGAD) of the Research Data Alliance (RDA), held on 21 and 22 September 2015 in Paris at INRA.
This document summarizes a session from the Force 11 Scholarly Communications Institute Summer School on data discovery. The session covered metadata, including what it is, types of metadata, and standards. It discussed how people search for and find data through various sources. The session also explored the FAIR data principles of findable, accessible, interoperable and reusable data and had breakout groups discuss applying these principles in practice.
An introduction to the FAIR principles and a discussion of key issues that must be addressed to ensure data is findable, accessible, interoperable and re-usable. The session explored the role of the CDISC and DDI standards for addressing these issues.
Presented by Gareth Knight at the ADMIT Network conference, organised by the Association for Data Management in the Tropics, in Antwerp, Belgium on December 1st 2015.
A presentation of the Dutch Techcentre for Life Sciences FAIR Data ecosystem given at the BlueBridge workshop, a pre-event of the Research Data Alliance's 9th Plenary
This document discusses the development of the DATS (Data Tag Suite), which is needed for DataMed to index data sources in a scalable way, similar to how JATS indexes literature for PubMed. The DATS model was developed through a community-driven process involving use cases and existing metadata schemas. It includes core and extended elements to describe datasets and other digital research objects. The model is designed around the dataset entity and serialized in JSON and JSON-LD mapped to schema.org to increase visibility, accessibility, and searchability. Efforts are ongoing to further align DATS with schema.org and integrate it with related metadata standards and tools.
Urm concept for sharing information inside of communitiesKarel Charvat
The document describes the Uniform Resource Management (URM) concept for sharing information within communities. URM provides a framework for standardized description of information using metadata schemes and controlled vocabularies to improve discovery. It is implemented through various portals and tools that allow users to manage and discover knowledge according to context. Initial implementations included portals for nature, sustainability and rural information in the Czech Republic and Latvia. URM supports collaborative knowledge sharing through interoperable systems based on open standards.
BioPharma and FAIR Data, a Collaborative AdvantageTom Plasterer
The concept of FAIR (Findable, Accessible, Interoperable and Reusable) data is becoming a reality as stakeholders from industry, academia, funding agencies and publishers are embracing this approach. For BioPharma being able to effectively share and reuse data is a tremendous competitive advantage, within a company, with peer organizations, key opinion leaders and regulatory agencies. A few key drivers, success stories and preliminary results of an industry data stewardship survey are presented.
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...dkNET
dkNET provides a single portal for discovering over 3,500 biomedical research resources and datasets. It aims to make these resources findable, accessible, interoperable, and reusable in accordance with the FAIR principles. The portal contains three main sections for browsing community resources, additional resources, and literature. It utilizes faceted searching and provides analytics and notifications to help users track changes to resources over time.
This document provides an overview of FAIR data principles and the FAIR data ecosystem. It discusses what FAIR data is, including that FAIR data aims to support communities in publishing and utilizing scientific data and knowledge in a findable, accessible, interoperable, and reusable manner. It then describes the different levels of the FAIR data ecosystem, including normative principles, standards in the FAIR data protocol, FAIR data resources that comply with these standards, and systems/tools that use FAIR data. It provides examples of converting raw data into FAIR data resources and the potential applications of a FAIR data ecosystem.
This document discusses metadata, which is defined in multiple ways such as "data about data" or "information that describes content". It describes the main types of metadata including descriptive, administrative, preservation, technical, and use metadata. Descriptive metadata facilitates discovery, identification, and selection of resources, while administrative metadata manages access and use. Preservation and technical metadata document how digital resources are created and maintained over time.
The Nuclear Receptor Signaling Atlas (NURSA) is partnering with dkNET (NIDDK Information Network) to host a dataset challenge, and we invite you to join! Everyone is talking about Big Data. How can we ensure that the impact of individual scientists working on a myriad of small and focused studies that discover and probe new phenomena - is not lost in the Big Data world. In fact, there is more than one way to generate big data and we would like your help in creating and expanding “big data” for NIDDK! In this 30-minute webinar, dkNET team will give a presentation about the overview of challenge task, how to use dkNET to find research resources, and top tips!
Metadata provides consistency, clarity, and data lineage. It defines structured information about documents and content like author, title, and keywords. There are three main types - descriptive, structural, and administrative. Metadata is used to identify, manage, retrieve, and track usage of content. It provides consistency of definitions between different terminology, clarity of relationships between entities, and clarity of where data originated from and how it has changed over time.
This document provides an overview of metadata standards, including their purpose and types. It describes the MARC 21 and Dublin Core metadata standards in detail. MARC 21 is the predominant bibliographic standard, with formats for bibliographic data, holdings, and authority data. It exists in both MARC 21 and MARCXML syntaxes. Dublin Core is a simpler standard for resource discovery with 15 basic elements. It includes both simple and qualified versions with controlled vocabularies. The document lists several metadata standards and development organizations.
Research data management (RDM) and the FAIR principles (Findable, Accessible, Interoperable, Reusable) are widely
promoted as basis for a shared research data infrastructure. Nevertheless, researchers involved in next generation
sequencing (NGS) still lack adequate RDM solutions. The NGS metadata is generally not stored together with the raw
NGS data, but kept by individual researchers in separate files. This situation complicates RDM practice. Moreover,
the (meta)data does often not meet the FAIR principles [6]. Consequently, a central FAIR-compliant repository
is highly desirable to support NGS related research. We have selected iRODS (Rule-Oriented Data management
systems) [3] as a basis for implementing a sequencing data repository because it allows storing both data and metadata
together. iRODS serves as scalable middleware to access different storage facilities in a centralized and virtualized
way, and supports different types of clients. This repository will be part of an ecosystem of RDM solutions that
cover complementary phases of the research data life cycle in our organization (Academic Medical Center of the
University of Amsterdam). We selected Virtuoso [5] to enrich the metadata from iRODS to enable the management
of a triplestore for linked data. The metadata in the iCat (iRODS’ metadata catalogue) and the ontology in Virtuoso
are kept synchronized by enforcement of strict data manipulation policies. We have implemented a prototype to
preserve raw sequencing data for one research group. Three iRODS client interfaces are used for different purposes:
Davrods [4] for data and metadata ingestion, data retrieval; Metalnx-web [7] for administration, data curation, and
repository browsing; and iCommands [2] for all tasks by advanced users. Different user profiles are defined (principal
investigator, data curator, repository administrator), with different access rights. New data is ingested by copying raw
sequence files and the corresponding metadata file (a sample sheet) to the landing collection on iRODS. An iRODS
rule is triggered by the sample sheet file, which extracts the metadata and registers it to the iCAT as AVU (Attribute,
Value and Unit). Ontology files are registered into Virtuoso. The sequence files are copied to the persistent collection
and are made uniquely identifiable based on metadata. All the steps are recorded into a report file that enables
monitoring and tracking of progress and faults. Here we describe the design and implementation of the prototype,
and discuss the first assessment results. Initial results indicate that the proposed solution is acceptable and fits the
researchers workflow well.
FAIR Data Management and FAIR Data SharingMerce Crosas
Presentation at the Critical Perspective on the Practice of Digiral Archeology symposium: http://archaeology.harvard.edu/critical-perspectives-practice-digital-archaeology
Dataverse, Cloud Dataverse, and DataTagsMerce Crosas
Talk given at Two Sigma:
The Dataverse project, developed at Harvard's Institute for Quantitative Social Science since 2006, is a widely used software platform to share and archive data for research. There are currently more than 20 Dataverse repository installations worldwide, with the Harvard Dataverse repository alone hosting more than 60,000 datasets. Dataverse provides incentives to researchers to share their data, giving them credit through data citation and control over terms of use and access. In this talk, I'll discuss the Dataverse project, as well as related projects such as DataTags to share sensitive data and Cloud Dataverse to share Big Data.
Neuroscience research increasingly relies on large, heterogeneous datasets from various sources. Integrating these diverse data types and making them accessible presents challenges. The NIF (Neuroscience Information Framework) addresses this by creating a federated search engine and unified interface to access multiple neuroscience databases. NIF aims to make neuroscience data more discoverable, accessible, and usable through techniques like unique identifiers, metadata standards, and semantic integration. This will help researchers more effectively find and use relevant neuroscience information.
Applied semantic technology and linked dataWilliam Smith
Mapping a human brain generates petabytes of gene listings and the corresponding locations of these genes throughout the human brain. Due to the large dataset a prototype Semantic Web application was created with the unique ability to link new datasets from similar fields of research, and present these new models to an online community. The resulting application presents a large set of gene to location mappings and provides new information about diseases, drugs, and side effects in relation to the genes and areas of the human brain.
In this presentation we will discuss the normalization processes and tools for adding new datasets, the user experience throughout the publishing process, the underlying technologies behind the application, and demonstrate the preliminary use cases of the project.
dkNET is a portal that provides a single point of entry for discovering NIDDK-relevant research resources and data to help researchers make efficient decisions. It allows searching across community databases, literature, and over 200 biomedical databases. Key features include personalized search capabilities, the ability to save searches and set up alerts, and creating collections of search results. The portal aims to interconnect research communities by providing access to large pools of interconnected data and resources.
DataTags, The Tags Toolset, and Dataverse IntegrationMichael Bar-Sinai
This presentation describes the concept of DataTags, which simplifies handling of sensitive datasets. It then shows the Tags toolset, and how it is integrated with Dataverse, Harvard's popular dataset repository.
Doing for Data what Pubmed did for literature: DATS a model for dataset description datasets indexing and data discovery.
Googleslides [https://goo.gl/cd5KKa] or Slideshare [https://goo.gl/c8DH5N]
Are you a researcher, citizen scientist, institution or community looking for data storage and value-added services? Do you want access to tools to make your research data more FAIR (findable, accessible, interoperable, and reusable)? Interested in seeing how the future European Open Science Cloud could support research data and practically foster cross-border, cross-disciplinary collaboration? Then this webinar is for you!
This document provides an overview of metadata, including:
1) Definitions of metadata from various sources, describing it as data that describes other data or information resources.
2) The main types of metadata - descriptive, processing, administrative, and semantic. Descriptive metadata retrieves information, processing metadata processes information, and administrative metadata manages information.
3) How metadata can be created automatically by tools or manually by people. Metadata schemes provide a formal structure to identify a discipline's knowledge and link it to information resources.
The document provides an introduction to PREMIS (Preservation Metadata: Implementation Strategies) and its application in audiovisual archives. It discusses the challenges of digital preservation and the need for preservation metadata to ensure long-term access. It then summarizes the key aspects of PREMIS, including the PREMIS Data Dictionary, its relationship to the OAIS reference model, the five interacting entities in the PREMIS data model, and issues around implementing PREMIS in archives.
FAIR data has flown up the hype curve without a clear sense of return from the required data stewardship investment. The killer use case for FAIR data is a science knowledge graph. It enables you to richly address novel questions of your and the world’s data. We started with data catalogues (findability) which exploited linked/referenced data using a few focused vocabularies (interoperability), for credentialed users (accessibility), with provenance and attribution (reusability) to make this happen.
This talk was presented at The Molecular Medicine Tri-Conference/Bio-IT West on March 11, 2019.
Urm concept for sharing information inside of communitiesKarel Charvat
The document describes the Uniform Resource Management (URM) concept for sharing information within communities. URM provides a framework for standardized description of information using metadata schemes and controlled vocabularies to improve discovery. It is implemented through various portals and tools that allow users to manage and discover knowledge according to context. Initial implementations included portals for nature, sustainability and rural information in the Czech Republic and Latvia. URM supports collaborative knowledge sharing through interoperable systems based on open standards.
BioPharma and FAIR Data, a Collaborative AdvantageTom Plasterer
The concept of FAIR (Findable, Accessible, Interoperable and Reusable) data is becoming a reality as stakeholders from industry, academia, funding agencies and publishers are embracing this approach. For BioPharma being able to effectively share and reuse data is a tremendous competitive advantage, within a company, with peer organizations, key opinion leaders and regulatory agencies. A few key drivers, success stories and preliminary results of an industry data stewardship survey are presented.
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...dkNET
dkNET provides a single portal for discovering over 3,500 biomedical research resources and datasets. It aims to make these resources findable, accessible, interoperable, and reusable in accordance with the FAIR principles. The portal contains three main sections for browsing community resources, additional resources, and literature. It utilizes faceted searching and provides analytics and notifications to help users track changes to resources over time.
This document provides an overview of FAIR data principles and the FAIR data ecosystem. It discusses what FAIR data is, including that FAIR data aims to support communities in publishing and utilizing scientific data and knowledge in a findable, accessible, interoperable, and reusable manner. It then describes the different levels of the FAIR data ecosystem, including normative principles, standards in the FAIR data protocol, FAIR data resources that comply with these standards, and systems/tools that use FAIR data. It provides examples of converting raw data into FAIR data resources and the potential applications of a FAIR data ecosystem.
This document discusses metadata, which is defined in multiple ways such as "data about data" or "information that describes content". It describes the main types of metadata including descriptive, administrative, preservation, technical, and use metadata. Descriptive metadata facilitates discovery, identification, and selection of resources, while administrative metadata manages access and use. Preservation and technical metadata document how digital resources are created and maintained over time.
The Nuclear Receptor Signaling Atlas (NURSA) is partnering with dkNET (NIDDK Information Network) to host a dataset challenge, and we invite you to join! Everyone is talking about Big Data. How can we ensure that the impact of individual scientists working on a myriad of small and focused studies that discover and probe new phenomena - is not lost in the Big Data world. In fact, there is more than one way to generate big data and we would like your help in creating and expanding “big data” for NIDDK! In this 30-minute webinar, dkNET team will give a presentation about the overview of challenge task, how to use dkNET to find research resources, and top tips!
Metadata provides consistency, clarity, and data lineage. It defines structured information about documents and content like author, title, and keywords. There are three main types - descriptive, structural, and administrative. Metadata is used to identify, manage, retrieve, and track usage of content. It provides consistency of definitions between different terminology, clarity of relationships between entities, and clarity of where data originated from and how it has changed over time.
This document provides an overview of metadata standards, including their purpose and types. It describes the MARC 21 and Dublin Core metadata standards in detail. MARC 21 is the predominant bibliographic standard, with formats for bibliographic data, holdings, and authority data. It exists in both MARC 21 and MARCXML syntaxes. Dublin Core is a simpler standard for resource discovery with 15 basic elements. It includes both simple and qualified versions with controlled vocabularies. The document lists several metadata standards and development organizations.
Research data management (RDM) and the FAIR principles (Findable, Accessible, Interoperable, Reusable) are widely
promoted as basis for a shared research data infrastructure. Nevertheless, researchers involved in next generation
sequencing (NGS) still lack adequate RDM solutions. The NGS metadata is generally not stored together with the raw
NGS data, but kept by individual researchers in separate files. This situation complicates RDM practice. Moreover,
the (meta)data does often not meet the FAIR principles [6]. Consequently, a central FAIR-compliant repository
is highly desirable to support NGS related research. We have selected iRODS (Rule-Oriented Data management
systems) [3] as a basis for implementing a sequencing data repository because it allows storing both data and metadata
together. iRODS serves as scalable middleware to access different storage facilities in a centralized and virtualized
way, and supports different types of clients. This repository will be part of an ecosystem of RDM solutions that
cover complementary phases of the research data life cycle in our organization (Academic Medical Center of the
University of Amsterdam). We selected Virtuoso [5] to enrich the metadata from iRODS to enable the management
of a triplestore for linked data. The metadata in the iCat (iRODS’ metadata catalogue) and the ontology in Virtuoso
are kept synchronized by enforcement of strict data manipulation policies. We have implemented a prototype to
preserve raw sequencing data for one research group. Three iRODS client interfaces are used for different purposes:
Davrods [4] for data and metadata ingestion, data retrieval; Metalnx-web [7] for administration, data curation, and
repository browsing; and iCommands [2] for all tasks by advanced users. Different user profiles are defined (principal
investigator, data curator, repository administrator), with different access rights. New data is ingested by copying raw
sequence files and the corresponding metadata file (a sample sheet) to the landing collection on iRODS. An iRODS
rule is triggered by the sample sheet file, which extracts the metadata and registers it to the iCAT as AVU (Attribute,
Value and Unit). Ontology files are registered into Virtuoso. The sequence files are copied to the persistent collection
and are made uniquely identifiable based on metadata. All the steps are recorded into a report file that enables
monitoring and tracking of progress and faults. Here we describe the design and implementation of the prototype,
and discuss the first assessment results. Initial results indicate that the proposed solution is acceptable and fits the
researchers workflow well.
FAIR Data Management and FAIR Data SharingMerce Crosas
Presentation at the Critical Perspective on the Practice of Digiral Archeology symposium: http://archaeology.harvard.edu/critical-perspectives-practice-digital-archaeology
Dataverse, Cloud Dataverse, and DataTagsMerce Crosas
Talk given at Two Sigma:
The Dataverse project, developed at Harvard's Institute for Quantitative Social Science since 2006, is a widely used software platform to share and archive data for research. There are currently more than 20 Dataverse repository installations worldwide, with the Harvard Dataverse repository alone hosting more than 60,000 datasets. Dataverse provides incentives to researchers to share their data, giving them credit through data citation and control over terms of use and access. In this talk, I'll discuss the Dataverse project, as well as related projects such as DataTags to share sensitive data and Cloud Dataverse to share Big Data.
Neuroscience research increasingly relies on large, heterogeneous datasets from various sources. Integrating these diverse data types and making them accessible presents challenges. The NIF (Neuroscience Information Framework) addresses this by creating a federated search engine and unified interface to access multiple neuroscience databases. NIF aims to make neuroscience data more discoverable, accessible, and usable through techniques like unique identifiers, metadata standards, and semantic integration. This will help researchers more effectively find and use relevant neuroscience information.
Applied semantic technology and linked dataWilliam Smith
Mapping a human brain generates petabytes of gene listings and the corresponding locations of these genes throughout the human brain. Due to the large dataset a prototype Semantic Web application was created with the unique ability to link new datasets from similar fields of research, and present these new models to an online community. The resulting application presents a large set of gene to location mappings and provides new information about diseases, drugs, and side effects in relation to the genes and areas of the human brain.
In this presentation we will discuss the normalization processes and tools for adding new datasets, the user experience throughout the publishing process, the underlying technologies behind the application, and demonstrate the preliminary use cases of the project.
dkNET is a portal that provides a single point of entry for discovering NIDDK-relevant research resources and data to help researchers make efficient decisions. It allows searching across community databases, literature, and over 200 biomedical databases. Key features include personalized search capabilities, the ability to save searches and set up alerts, and creating collections of search results. The portal aims to interconnect research communities by providing access to large pools of interconnected data and resources.
DataTags, The Tags Toolset, and Dataverse IntegrationMichael Bar-Sinai
This presentation describes the concept of DataTags, which simplifies handling of sensitive datasets. It then shows the Tags toolset, and how it is integrated with Dataverse, Harvard's popular dataset repository.
Doing for Data what Pubmed did for literature: DATS a model for dataset description datasets indexing and data discovery.
Googleslides [https://goo.gl/cd5KKa] or Slideshare [https://goo.gl/c8DH5N]
Are you a researcher, citizen scientist, institution or community looking for data storage and value-added services? Do you want access to tools to make your research data more FAIR (findable, accessible, interoperable, and reusable)? Interested in seeing how the future European Open Science Cloud could support research data and practically foster cross-border, cross-disciplinary collaboration? Then this webinar is for you!
This document provides an overview of metadata, including:
1) Definitions of metadata from various sources, describing it as data that describes other data or information resources.
2) The main types of metadata - descriptive, processing, administrative, and semantic. Descriptive metadata retrieves information, processing metadata processes information, and administrative metadata manages information.
3) How metadata can be created automatically by tools or manually by people. Metadata schemes provide a formal structure to identify a discipline's knowledge and link it to information resources.
The document provides an introduction to PREMIS (Preservation Metadata: Implementation Strategies) and its application in audiovisual archives. It discusses the challenges of digital preservation and the need for preservation metadata to ensure long-term access. It then summarizes the key aspects of PREMIS, including the PREMIS Data Dictionary, its relationship to the OAIS reference model, the five interacting entities in the PREMIS data model, and issues around implementing PREMIS in archives.
FAIR data has flown up the hype curve without a clear sense of return from the required data stewardship investment. The killer use case for FAIR data is a science knowledge graph. It enables you to richly address novel questions of your and the world’s data. We started with data catalogues (findability) which exploited linked/referenced data using a few focused vocabularies (interoperability), for credentialed users (accessibility), with provenance and attribution (reusability) to make this happen.
This talk was presented at The Molecular Medicine Tri-Conference/Bio-IT West on March 11, 2019.
Essentials 4 Data Support: a fine course in FAIR Data SupportEllen Verbakel
The document summarizes the Essentials 4 Data Support (E4DS) course, which teaches people how to support researchers in storing, managing, archiving, and sharing research data according to FAIR principles. The course covers topics like data documentation, identifiers, formats, metadata, and licensing. It is offered online or in a blended format over 6 weeks. The goal is to educate data supporters so that researchers can find, access, interoperate with, and reuse each other's data in a fair manner.
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...OpenAIRE
OpenAIRE Interoperability Workshop (8 Feb. 2013).
DataCite – Bridging the gap and helping to find, access and reuse data – Herbert Gruttemeier, INIST-CNRS
Tools and Techniques for Creating, Maintaining, and Distributing Shareable Me...Jenn Riley
This document discusses tools and techniques for creating, maintaining, and distributing shareable metadata. It emphasizes that metadata should be structured to be understandable outside of local contexts and useful for other institutions. Key aspects of shareable metadata include using appropriate content and vocabularies, ensuring records are coherent, providing useful context, and conforming to standards. The document also outlines example workflows and considerations for making metadata shareable.
FAIR Data Knowledge Graphs–from Theory to PracticeTom Plasterer
This document discusses building FAIR data knowledge graphs from theory to practice. It begins by outlining what R&D researchers want to do with data, such as understanding disease mechanisms and using patient data, but that currently data is fragmented across systems. It then introduces the FAIR data principles and describes building a knowledge graph that incorporates data from multiple sources using standards like the Data Catalog vocabulary. The key challenges discussed are determining canonical representations for entities and linking data to public vocabularies through an enrichment process.
The webinar discussed FAIRDOM services that can help applicants to the ERACoBioTech call with their data management plans and requirements. FAIRDOM offers webinars on developing data management plans, and their platform and tools can help with organizing, storing, sharing, and publishing research data and models in a FAIR manner by utilizing metadata standards. Different levels of support are available, from general community resources through their hub, to premium customized support for individual projects. Consortia can include FAIRDOM as a subcontractor within the guidelines of the ERACoBioTech call.
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...ASIS&T
Research Data Access and Preservation Summit, 2015
Minneapolis, MN
April 22-23, 2015
Part of “Beyond metadata: Supporting non-standardized documentation to facilitate data reuse”
This document discusses metadata, which is data that describes other data. It explains that metadata captures information about data content, time, location, how it was collected and why. Good metadata allows data to be discovered, accessed, and understood. Key elements in a metadata record are identified and standards like FGDC and ISO are examined. The value of metadata for data users, providers and organizations is outlined. Tips for writing clear, complete and computer-readable metadata are provided.
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM
This document provides information about a webinar from the FAIRDOM Consortium on data management for ERACoBioTech full proposals. It includes:
- Details on how to budget for and include a data management plan in proposals
- A checklist for developing a data management plan covering topics like the types and volumes of data, data sharing and reuse, and making data FAIR
- An overview of the FAIRDOM services and software platform that can help with project data management and stewardship
The document discusses recommendations from a workshop on peer review of research data. It focuses on three key areas:
1. Connecting data review with data management planning by requiring data sharing plans, ensuring adequate funding for data management, and refusing publication without clear data access.
2. Connecting scientific and technical review with data curation by linking articles and data with versioning, avoiding duplicate review efforts, and addressing issues found in data.
3. Connecting data review with article review by requiring methods/software information, providing review checklists, ensuring data access for reviewers, and permanent dataset identifiers from repositories.
This document discusses the agINFRA project's efforts to enhance interoperability between agricultural data sources by developing a linked data framework for germplasm data. The agINFRA Germplasm Working Group aims to identify relevant standards, analyze existing schemas and vocabularies, and propose recommendations for exposing germplasm resources as linked open data. Key outcomes include a dossier of germplasm information and engagement with stakeholders. The proposed methodology involves defining a base schema, publishing local classifications as linked data, and linking data from different sources using common vocabularies. Implementation plans include publishing germplasm vocabularies and phenotypic data in 2014.
NIH iDASH meeting on data sharing - BioSharing, ISA and Scientific DataSusanna-Assunta Sansone
1) The document discusses Susanna-Assunta Sansone's roles and work related to promoting FAIR data standards and practices.
2) It highlights some of her leadership positions with organizations like BioSharing that work to map and promote standards.
3) The document also discusses Scientific Data, a peer-reviewed journal launched by Nature Publishing Group to publish detailed descriptions of scientifically valuable datasets to facilitate reuse.
Steven McEachern - ADA, DDI (metadata standard) and the Data LifecycleSteve Androulakis
Dr. McEachern is Director of the Australian Data Archive at the Australian National University, and has research interests in data management and archiving, community and social attitude surveys, new data collection methods, and reproducible research methods.
This talk was given for the Monthly Tech Talks event hosted by Australian data infrastructure groups ANDS, NeCTAR, RDS and others.
(1) The document discusses challenges of managing large and complex datasets for interdisciplinary research projects. It presents Hadoop and the Etosha data catalog as solutions.
(2) Etosha aims to publish and link metadata about datasets to enable discovery and sharing across distributed research clusters. It focuses on descriptive, structural and administrative metadata rather than just technical metadata.
(3) Etosha's architecture includes a distributed metadata service and context browser that can query metadata from different Hadoop clusters to support federated querying and subquery delegation.
How Portable Are the Metadata Standards for Scientific Data?Jian Qin
The one-covers-all approach in current metadata standards for scientific data has serious limitations in keeping up with the ever-growing data. This paper reports the findings from a survey to metadata standards in the scientific data domain and argues for the need for a metadata infrastructure. The survey collected 4400+ unique elements from 16 standards and categorized these elements into 9 categories. Findings from the data included that the highest counts of element occurred in the descriptive category and many of them overlapped with DC elements. This pattern also repeated in the elements co-occurred in different standards. A small number of semantically general elements appeared across the largest numbers of standards while the rest of the element co-occurrences formed a long tail with a wide range of specific semantics. The paper discussed implications of the findings in the context of metadata portability and infrastructure and pointed out that large, complex standards and widely varied naming practices are the major hurdles for building a metadata infrastructure.
How Portable Are the Metadata Standards for Scientific Data?Jian Qin
The one-covers-all approach in current metadata standards for scientific data has serious limitations in keeping up with the ever-growing data. This paper reports the findings from a survey to metadata standards in the scientific data domain and argues for the need for a metadata infrastructure. The survey collected 4400+ unique elements from 16 standards and categorized these elements into 9 categories. Findings from the data included that the highest counts of element occurred in the descriptive category and many of them overlapped with DC elements. This pattern also repeated in the elements co-occurred in different standards. A small number of semantically general elements appeared across the largest numbers of standards while the rest of the element co-occurrences formed a long tail with a wide range of specific semantics. The paper discussed implications of the findings in the context of metadata portability and infrastructure and pointed out that large, complex standards and widely varied naming practices are the major hurdles for building a metadata infrastructure.
Similar to eROSA Stakeholder WS1: Data discovery through federated dataset catalogues (20)
"Building Capacities for Open Science" - The example of AGINFRA+ and e-ROSA. Presented during the AGRIRESEARCH Conference, organised by DG AGRI in Brussels.
Community and Governance Recommendations for the Future State of an e-infrast...e-ROSA
This document provides recommendations for developing an e-infrastructure to support open science in agri-food systems. It identifies key societal challenges around feeding the growing population, climate change, unhealthy diets, and environmental pressures. Three major trends are digital agriculture, new genetic techniques, and adopting a systems perspective. Recommendations focus on sharing data and models, connecting diverse data sources through standards, and facilitating collaboration across disciplines and sectors. Specific recommendations include establishing sustainable funding, aligning with the European Open Science Cloud, promoting open innovation, and developing large public-private partnerships for data-driven research. The overarching goal is to support evidence-based policymaking and address challenges through open, international cooperation.
Technical Recommendations for the Future State of an e-infrastructure in Agri...e-ROSA
This document outlines recommendations for the future technical state of an e-infrastructure for agri-food sciences. It describes the past state as isolated research silos, the present as basic shared services but disjoint complex services, and envisions the future as:
- Extending shared horizontal services to include mature technologies from different communities
- Optimizing shared infrastructure usage for each community/task
- Easily customizing horizontal services for specific community needs
- Seamlessly incorporating new services
It recommends:
- Developing large-scale common data/service semantics and standards
- Incorporating infrastructure under a federation layer for optimized usage/sharing
- Making cross-community services available via semantic descriptions to autom
Odile Hologne's presentation at the eROSA Workshop “Towards Open Science in Agriculture & Food”, a side event to High Level conference on FOOD 2030, Plovdiv, Bulgaria (13/6/2018)
FACCE JPI agenda on big data and digitization of agriculturee-ROSA
Paul Wiley's presentation at the eROSA Workshop “Towards Open Science in Agriculture & Food”, a side event to High Level conference on FOOD 2030, Plovdiv, Bulgaria (13/6/2018)
ICT-AGRI agenda on digitization of agriculturee-ROSA
This document discusses trends in precision farming and an overview of research and innovation activities related to digitizing agriculture. It outlines key trends such as the increasing use of sensors, drones, robotics, and network connectivity in agriculture. It also discusses trends in software including big data, open data standards, apps for farm management, and integrating data along the farm to fork supply chain. The document concludes by noting the growth of startups in this area and opportunities for the ICT-AGRI initiative to contribute to an open agrifood science cloud.
D4Science experience: VREs for increasing the sharing and collaboration in th...e-ROSA
Donatella Castelli's presentation at the eROSA Workshop “Towards Open Science in Agriculture & Food”, a side event to High Level conference on FOOD 2030, Plovdiv, Bulgaria (13/6/2018)
The state-of-play of the general EOSC policy worke-ROSA
Corina Pascu's presentation at the eROSA Workshop “Towards Open Science in Agriculture & Food”, a side event to High Level conference on FOOD 2030, Plovdiv, Bulgaria (13/6/2018)
The Vision and the Grand Challenges of the Agri-Food Communitye-ROSA
The document discusses the vision and grand challenges of the agri-food community. It identifies three main trends: adopting a systems perspective, new genetic techniques, and digital agriculture. It outlines the food system challenges of feeding 9 billion people while addressing climate change, unhealthy diets, and planetary boundaries. The food system is divided into three components: smart farming and food security, gene-based approaches, and food safety, nutrition and health. Each component lists societal and scientific expectations as well as obstacles to open science approaches. The overall challenges are interconnectedness and developing inclusive, sustainable solutions through increased sharing, connecting and collaborating across the agri-food community.
Why the food sector needs a research infrastructure on Food and Health Consum...e-ROSA
Bent Egberg Mikkelsen and Karin Zimmermann's presentation at the eROSA Workshop “Towards Open Science in Agriculture & Food”, a side event to High Level conference on FOOD 2030, Plovdiv, Bulgaria (13/6/2018)
The document summarizes a vision for food systems in 2030 presented at an eROSA stakeholder workshop. The vision is for food systems that produce healthy, nutritious foods through efficient and environmentally sustainable methods. These food systems would operate as collaborative networks constantly improving their economic, environmental, and social performance for all actors. The food systems would contribute to achieving sustainability development goals and mitigate/adapt to climate change impacts.
Technical Implementation Agenda for a pan-European Scientific e-infrastructur...e-ROSA
This document outlines a vision for a pan-European e-infrastructure for agri-food research. It describes the current fragmented state of individual research organizations and isolated data silos. The vision is to build common semantic specifications and standards to incorporate physical infrastructure and make cross-community services available via semantically enriched descriptions. This would automate the integration of existing and new services to optimize resource sharing and data integration across communities. The priorities are establishing standards and semantics, designing common horizontal services, and specifying community-specific services to work towards the goal of mission-driven research enabled by a unified e-infrastructure.
E-Infrastructure for open agri-food sciences - The landscapee-ROSA
eROSA has received funding from the European Union to map out the technical ecosystem for open agriculture and food science data. The mapping is based on analyzing various eROSA, RDA, and other project activities and identifies organizations, initiatives, data sources, and research infrastructures. The landscape analysis found that while there is massive data production, data is often siloed and difficult to find or access due to immature practices around data management, sharing, and analysis. Challenges include technical issues like long-term preservation and semantics standardization as well as cultural challenges engaging communities and developing sustainable governance models.
This document summarizes an OpenAIRE stakeholder workshop that took place in Athens on May 21-22, 2018. OpenAIRE supports open science by monitoring research outputs, accelerating interoperability and exchange, and supporting researchers and infrastructure providers through services like an open science helpdesk and research data management support. The workshop discussed OpenAIRE's network of National Open Access Desks, services to support open policies, infrastructure, open research data and open access publications, and efforts to build an open scholarly communication graph and research information system. OpenAIRE also presented services for content providers like the PROVIDE Dashboard for validation, enrichment and usage statistics of metadata.
The document describes the D4Science infrastructure, which provides services and environments to support cross-disciplinary research activities. It offers data discovery, access, processing and publishing services across multiple domains like marine science, social mining, and the humanities. The infrastructure leverages existing resources through federation and APIs, and provides virtual research environments and workspaces in a flexible, scalable manner to support over 5,100 users in 44 countries.
EOSC-Hub - Services for the European Open Science Cloude-ROSA
The document summarizes the objectives and services of EOSC-hub, which is implementing and operating access channels for the European Open Science Cloud (EOSC). EOSC-hub aims to (1) aggregate services from local/national providers and demands from researchers through the EOSC, (2) define engagement rules with EOSCpilot and develop a service framework, and (3) operate and integrate an initial set of baseline, thematic, and federation services. The services support the full research data lifecycle from discovery to reuse. EOSC-hub involves 74 partners from 23 countries and receives €30 million in Horizon 2020 funding over 3 years to develop and advance EOSC.
Grand Challenges and Open Science for the Food Systeme-ROSA
The document discusses open science approaches for addressing challenges in the global food system. It identifies three key components of the food system - smart farming, food security and the environment; gene-based approaches from omics to landscape; and food safety, nutrition and health. For each component, it outlines societal and scientific challenges, as well as obstacles and expectations for developing open science solutions. An example case study on global agricultural monitoring is also provided. The document argues that developing open science for food systems requires efforts to share data and resources, connect through standards and best practices, and enable broader collaboration across disciplines and sectors.
This document summarizes a presentation about the eROSA project, which received Horizon 2020 funding. It discusses eROSA's vision for an open e-science infrastructure for agriculture. Some key points include:
- eROSA aims to provide shared semantics, data discovery services, and sustainable storage through resources like data portals and virtual research environments.
- It compares how organic agriculture aligns with the UN's Sustainable Development Goals around issues like increasing productivity and resilience while reducing environmental impacts.
- The document outlines eROSA's status in implementing facets of openness, interoperability, and reuse within the agricultural domain. It closes with eROSA's vision for collaborative, region-specific food systems by
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
eROSA Stakeholder WS1: Data discovery through federated dataset catalogues
1. Data discovery
through federated
dataset catalogs
Valeria Pesce
Secretariat of the Global Forum on Agricultural Research (GFAR)
Secretariat of the Global Open Data for Agriculture and Nutrition (GODAN) initiative
eROSA workshop, Montpellier, 6-7 July 2017
2. • Many institutional catalogs / geographically-scoped catalogs / thematic catalogs
• How many catalogs do I have to search?
>> General meta-catalogs? Different targeted catalogs?
>> Federated metadata catalogs
/ Secondary catalogs
1. Dataset discovery: how
• Data are in datasets, stored in some dataset repository
• Datasets can be made searchable through a dataset catalog
Good dataset metadata at the level of the local repository / catalog
Open interoperable dataset metadata
at the level of the primary repository / catalog
IDEAL
>>> Linked Data federated search engines LOD-enabled
primary catalog
Heavy requirements for the
local primary catalog
3. 1. Dataset discovery: good metadata (1)
1. General metadata about the dataset “resource”:
a) identifier(s)
b) who is responsible for it
c) when and where the data were collected
d) relations to organizations, persons, publications, software, projects, funding…
e) the conditions for re-use (rights, licenses)
f) provenance, versions
g) the specific coverage of the dataset (type of data, thematic coverage, geographic
coverage)
Normally covered by generic vocabularies like Dublin Core or DCAT
IDEAL
Let’s look at existing good practices and standards
4. 1. Dataset discovery: good metadata (2)
a) The variables: the observed “dimensions” (e.g. time, geographic region, gender,
elevation…) and the measured / observed phenomenon (e.g. life expectancy)
b) The specification of the dimensions (units of measure, time granularity, syntax, any
scaling factors and metadata such as the status of the observation, reference
taxonomies…)
c) Possible time and space slices; subsets
Not always considered in generic dataset metadata vocabularies (DCAT) but traditionally
included in research datasets (e.g. in formats like NetCDF) and covered by DataCube
IDEAL
2. Metadata about the data structure!
5. 1. Dataset discovery: good metadata (3)
1. Where to retrieve the dataset: URL (data dump, service…)
2. The necessary technical specifications to retrieve and parse a distribution of the
dataset:
- format (file format, data format), vocabularies / data dictionaries
- protocol, API parameters…
Not always considered in generic dataset metadata vocabularies: DCAT covers data
dump and format, VOID some services
IDEAL
3. Metadata about the actual “serializations” or “distributions” of the
dataset.
Data will be processed by tools! Data formats and access protocols are important.
6. 1. Dataset discovery: interoperable metadata
Secondary catalogs have to be able to retrieve metadata from the dataset catalog
IDEAL
Ideally, secondary catalogs would be able to retrieve only subsets of the
catalog (by type of data, by data format, by phenomenon observed?)
Data service / API with filtering parameters Catalogs as DAAS - Data-as-a-Service
• All discovery-relevant metadata are exposed in machine-readable form
• Exposed metadata use shared semantics
• Standardization of the values, e.g. for “thematic coverage” or “dimensions” of
datasets, “format” or “protocol used” of distributions etc.
• The value should be standardized, possibly a URI
• The value should be part of an authority list / code list
7. 1. Dataset discovery: ideal architecture
Conclusions
• Dataset metadata ideally created by authors / curator at the local level,
catalog associated with repository
• High-quality metadata in catalogs allowing for answers to all possible queries
• Ownership, rights, temporal, spatial, thematic, data structure, access…
• Machine-readable metadata; agreed vocabularies; shared semantics; APIs for
querying
• General or specialized secondary catalogs federate metadata from primary
catalogs; multiply discoverability and cater for different audiences
• Also secondary catalogs expose good metadata and APIs
• There’s an inventory / registry of dataset repositories and all types of catalogs
IDEAL
8. 2. Dataset discovery: current situation
in Agriculture (1)
• Institutional data repositories are picking up (need for an inventory!)
CURRENT
• Use of standardized or semi-standardized data repository tools with
cataloguing functionalities and APIs is picking up (Dataverse, CKAN…)
• Some governmental metadata catalogs exist, often using standardized tools
(CKAN) and standard vocabularies (DCAT), that include agricultural datasets
• Some international data catalogs exist that include agricultural datasets
(re3data, OpenAIRE, DataHub…)
• Also research-oriented data services like OpenDAP or Unidata THREDDS
• Some secondary federated catalogs exist (? Need for an inventory!)
• General one for agriculture (usable as an inventory): the CIARD RING
9. 2. Dataset discovery: current situation
Example of CIARD RING secondary catalog
• Architecture:
• Datasets can be hosted anywhere, the RING only hosts the metadata
• Optionally, datasets can be uploaded and the RING can act as a subsidiary repository
• Datasets (metadata) can be federated from other catalogs
• It uses the dataset / distribution DCAT model
• Metadata quality:
• it uses a combination of the DCAT model + the VOID vocabulary and the DataCube
vocabulary + some extra properties (? a “RING DCAT profile” will be published)
• Shared semantics:
• it has a Linked Data layer, URIs for all entities; all categories are published as SKOS
concepts in SKOS concept schemes and are mapped to external concepts whenever
possible
10. 2. Dataset discovery: current situation
in Agriculture (2)
• Metadata quality of most used primary data catalog tools is not high
• E.g. no metadata about data structure, no shared semantics for data types, topics,
formats, standards used
CURRENT
>> poor discovery services in secondary catalogs
• Metadata interoperability of most used primary data catalog tools is not high
• No full compliance with broadly recognized vocabularies (DCAT, DataCube…)
• No functionality to apply shared semantics for categorizations like topics, data types,
formats, dimensions (sometimes keywords from AGROVOC)
• Data not always accessible through exposed metadataNo full compliance
with broadly recognized vocabularies (DCAT, DataCube…)
• Scarce population >> lack of reputation / authority of secondary catalogs.>
lack of motivation to share
11. 3. Dataset discovery: infrastructural improvements (1)
Quality depends on the metadata coming from the primary repository /
catalog… how can a good infrastructure overcome this problem?
• Advocacy for better / improved tools?
• Promote improvement of existing tools?
• Dataset repository / catalog platforms in the cloud?
• Complementary / subsidiary role of secondary catalogs?
• Allow subsidiary use of secondary catalogs as primary catalog and even repository
for some datasets (small institutions, individuals)
• Cater or the improvement of metadata directly in the secondary catalogs
• Incentives to provide good metadata?
• E.g. offer mechanisms to a) measure reuse; b) enforce respect of usage rights.
12. • Good agreed metadata standards and reference value vocabularies
• Combine existing standards (DCAT, DataCube, VOID…) in an application
profile?
• Provide a reference framework of agreed value vocabularies with URIs?
Mapping from local values to agreed ones?
>> AgriSemantics – GACS, VEST/AgroPortal
• Avoid too much interdependence. Design a loosely coupled
infrastructure. (How?)
3. Dataset discovery: infrastructural improvements (2)
13. Key questions
• Is it our task to aim at having better machine-readable metadata at the level
of the primary local repository / catalog?
• How can we influence this? Advocate for including metadata in researcher’s tools?
• Do we want to “drive” secondary catalogs or let them bloom? Or both?
• At least a global one for food&ag? How many? Who decides? Who manages them?
• How can other infrastructural components facilitate good catalogs?
• Subsidiary metadata in secondary catalogs? Good dataset catalog tools in the cloud?
• Good agreed metadata standards and reference value vocabularies?
• Mapping with local values
• How to design for resilience of the system? Loosely coupled components?
• How much of this is specific to food&ag and which aspects should be tackled
in a broader context? (EOSC?)
14. Data discovery
through dataset repositories
and catalogs
Thank you for your attention
eROSA workshop, Montpellier, 6-7 July 2017
15. Some recommendations from EC High Level
Expert Group on EOSC (1)
“An Internet of data and services where containers with software
applications are routed to relevant data and vice versa” (B. Mons)
- Develop and sustain core data assets for the EOSC and make them
available to the community under well-defined conditions. These may
include workflows, analytics programmes and notably existing datasets
with FAIR status (including metadata creation)
- Support the development of one or more publicly available data search
engine(s) that find FAIR metadata across trusted EOSC repositories
- Develop technologies and approaches to meaningfully measure re-use
and scientific impact of Research Objects after their initial publication
(e.g. metrics that matter and get recognised)
16. Some recommendations from EC High Level
Expert Group on EOSC (1)
- Start dedicated efforts to prepare data and research objects for
inclusion in the EOSC
- Combine single sign-on issues with the connection of social and
professional people oriented web applications resulting in a federated
identity and credentials for all people in the EOSC
- A repository of research vocabularies and a software application to
support wider access, reuse and development of vocabularies thereby
enhancing interoperability