Supporting Dataset Descriptions in the Life SciencesAlasdair Gray
Machine processable descriptions of datasets can help make data more FAIR; that is Findable, Accessible, Interoperable, and Reusable. However, there are a variety of metadata profiles for describing datasets, some specific to the life sciences and others more generic in their focus. Each profile has its own set of properties and requirements as to which must be provided and which are more optional. Developing a dataset description for a given dataset to conform to a specific metadata profile is a challenging process.
In this talk, I will give an overview of some of the dataset description specifications that are available. I will discuss the difficulties in writing a dataset description that conforms to a profile and the tooling that I've developed to support dataset publishers in creating metadata description and validating them against a chosen specification.
Seminar talk given at the EBI on 5 April 2017
An Identifier Scheme for the Digitising Scotland ProjectAlasdair Gray
The Digitising Scotland project is having the vital records of Scotland transcribed from images of the original handwritten civil registers . Linking the resulting dataset of 24 million vital records covering the lives of 18 million people is a major challenge requiring improved record linkage techniques. Discussions within the multidisciplinary, widely distributed Digitising Scotland project team have been hampered by the teams in each of the institutions using their own identification scheme. To enable fruitful discussions within the Digitising Scotland team, we required a mechanism for uniquely identifying each individual represented on the certificates. From the identifier it should be possible to determine the type of certificate and the role each person played. We have devised a protocol to generate for any individual on the certificate a unique identifier, without using a computer, by exploiting the National Records of Scotland•À_s registration districts. Importantly, the approach does not rely on the handwritten content of the certificates which reduces the risk of the content being misread resulting in an incorrect identifier. The resulting identifier scheme has improved the internal discussions within the project. This paper discusses the rationale behind the chosen identifier scheme, and presents the format of the different identifiers. The work reported in the paper was supported by the British ESRC under grants ES/K00574X/1(Digitising Scotland) and ES/L007487/1 (Administrative Data Research Center - Scotland).
In our series on The Yosemite Project, we explore RDF as a data standard for health data. In this presentation, we will discuss with Claude Nanjo, a Software Architect at Cognitive Medical Systems, ways to expose clinical knowledge as OWL and RDF resources on the Web in order to promote greater convergence in the representation of health knowledge in the longer term. We will also explore how one might rally and coordinate the community to seed the Web with a core set of high-value resources and technologies that could greatly enhance health interoperability.
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Alasdair Gray
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this tutorial is to explain elements of the HCLS community profile and to enable users to craft and validate descriptions for datasets of interest.
Yosemite Project - Part 3 - Transformations for Integrating VA data with FHIR...DATAVERSITY
In our series on The Yosemite Project, we explore RDF as a data standard for health data. In this installment, we will hear from Rafael Richards, Physician Informatician, Office of Informatics and Analytics in the Veterans Health Administration (VHA), about “Transformations for Integrating VA data with FHIR in RDF.”
The VistA EHR has its own data model and vocabularies for representing healthcare data. This webinar describes how SPARQL Inference Notation (SPIN) can be used to translate VistA data to the data represented used by FHIR, an emerging interchange standard.
Supporting Dataset Descriptions in the Life SciencesAlasdair Gray
Machine processable descriptions of datasets can help make data more FAIR; that is Findable, Accessible, Interoperable, and Reusable. However, there are a variety of metadata profiles for describing datasets, some specific to the life sciences and others more generic in their focus. Each profile has its own set of properties and requirements as to which must be provided and which are more optional. Developing a dataset description for a given dataset to conform to a specific metadata profile is a challenging process.
In this talk, I will give an overview of some of the dataset description specifications that are available. I will discuss the difficulties in writing a dataset description that conforms to a profile and the tooling that I've developed to support dataset publishers in creating metadata description and validating them against a chosen specification.
Seminar talk given at the EBI on 5 April 2017
An Identifier Scheme for the Digitising Scotland ProjectAlasdair Gray
The Digitising Scotland project is having the vital records of Scotland transcribed from images of the original handwritten civil registers . Linking the resulting dataset of 24 million vital records covering the lives of 18 million people is a major challenge requiring improved record linkage techniques. Discussions within the multidisciplinary, widely distributed Digitising Scotland project team have been hampered by the teams in each of the institutions using their own identification scheme. To enable fruitful discussions within the Digitising Scotland team, we required a mechanism for uniquely identifying each individual represented on the certificates. From the identifier it should be possible to determine the type of certificate and the role each person played. We have devised a protocol to generate for any individual on the certificate a unique identifier, without using a computer, by exploiting the National Records of Scotland•À_s registration districts. Importantly, the approach does not rely on the handwritten content of the certificates which reduces the risk of the content being misread resulting in an incorrect identifier. The resulting identifier scheme has improved the internal discussions within the project. This paper discusses the rationale behind the chosen identifier scheme, and presents the format of the different identifiers. The work reported in the paper was supported by the British ESRC under grants ES/K00574X/1(Digitising Scotland) and ES/L007487/1 (Administrative Data Research Center - Scotland).
In our series on The Yosemite Project, we explore RDF as a data standard for health data. In this presentation, we will discuss with Claude Nanjo, a Software Architect at Cognitive Medical Systems, ways to expose clinical knowledge as OWL and RDF resources on the Web in order to promote greater convergence in the representation of health knowledge in the longer term. We will also explore how one might rally and coordinate the community to seed the Web with a core set of high-value resources and technologies that could greatly enhance health interoperability.
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Alasdair Gray
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this tutorial is to explain elements of the HCLS community profile and to enable users to craft and validate descriptions for datasets of interest.
Yosemite Project - Part 3 - Transformations for Integrating VA data with FHIR...DATAVERSITY
In our series on The Yosemite Project, we explore RDF as a data standard for health data. In this installment, we will hear from Rafael Richards, Physician Informatician, Office of Informatics and Analytics in the Veterans Health Administration (VHA), about “Transformations for Integrating VA data with FHIR in RDF.”
The VistA EHR has its own data model and vocabularies for representing healthcare data. This webinar describes how SPARQL Inference Notation (SPIN) can be used to translate VistA data to the data represented used by FHIR, an emerging interchange standard.
Our speaker, Joshua Mandel, will provide a lightning tour of Fast Healthcare Interoperability Resources (FHIR), an emerging clinical data standard, with a focus on its resource-oriented approach, and a discussion of how FHIR intersects with the Semantic Web. We'll look at how FHIR represents links between entities; how FHIR represents concepts from standards-based vocabularies; and how a set of FHIR instance data can be represented in RDF.
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsAlasdair Gray
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets.
The goal of this presentation is to give an overview of the HCLS Community Profile and explain how it extends and builds upon other approaches.
Presentation given at SDSVoc (https://www.w3.org/2016/11/sdsvoc/)
Validata: A tool for testing profile conformanceAlasdair Gray
Validata (http://hw-swel.github.io/Validata/) is an online web application for validating a dataset description expressed in RDF against a community profile expressed as a Shape Expression (ShEx). Additionally it provides an API for programmatic access to the validator. Validata is capable of being used for multiple community agreed standards, e.g. DCAT, the HCLS community profile, or the Open PHACTS guidelines, and there are currently deployments to support each of these. Validata can be easily repurposed for different deployments by providing it with a new ShEx schema. The Validata code is available from GitHub (https://github.com/HW-SWeL/Validata).
Presentation given at SDSVoc https://www.w3.org/2016/11/sdsvoc
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
Linked Open Data Libraries Archives Museums. This presentation is a basic overview of what LOD is and what technologies are needed to ensure the metadata around your collections is machine readable.
Scientists’ Hard Drives, Databases, and Blogs: Preservation Intent and Source...Trevor Owens
Carl Sagan’s WordPerfect files, simulations emailed to Edward Lorenz, a database application from the National Library of Medicine, a collection of science blogs, a database of interstellar distances; each of these digital artifacts have been acquired by archives and special collections. Born digital primary sources are no longer a future concern for archivists, librarians, curators and historians. As historians of science turn their attention to the late 20th and early 21st century, they will need to work from these born-digital primary sources. We have already accumulated a significant born digital past and it’s time for work with born digital primary sources to become mainstream. This presentation will give a quick tour of individual born digital artifacts toward two goals. First, I argue for the need for archivists, curators and librarians to reflexively develop approaches to establishing preservation intent for digital content grounded in a dialog with the nature of a given set of digital objects and it’s future research use. Second, for historians, I suggest how trends in computational analysis of information in the digital humanities should be combined with approaches from digital forensics and new media studies to establish historiographic practices for born-digital source criticism. I conclude by suggesting the kinds of technical skills archivists, librarians, curators and historians working with these materials are going to need to develop. Just as historians working with premodern documents require language and paleography skills, historians working with digital artefacts will increasingly need to understand the inscription processes of hard drives, the provenance created by web crawlers, and how to read relational databases of varying vintages.
Consuming Linked Data by Humans - WWW2010Juan Sequeda
These are the Consuming Linked Data by Humans slides that we presented at the Consuming Linked Data tutorial at WWW2010 in Raleigh, NC on April 26, 2010
Our speaker, Joshua Mandel, will provide a lightning tour of Fast Healthcare Interoperability Resources (FHIR), an emerging clinical data standard, with a focus on its resource-oriented approach, and a discussion of how FHIR intersects with the Semantic Web. We'll look at how FHIR represents links between entities; how FHIR represents concepts from standards-based vocabularies; and how a set of FHIR instance data can be represented in RDF.
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsAlasdair Gray
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets.
The goal of this presentation is to give an overview of the HCLS Community Profile and explain how it extends and builds upon other approaches.
Presentation given at SDSVoc (https://www.w3.org/2016/11/sdsvoc/)
Validata: A tool for testing profile conformanceAlasdair Gray
Validata (http://hw-swel.github.io/Validata/) is an online web application for validating a dataset description expressed in RDF against a community profile expressed as a Shape Expression (ShEx). Additionally it provides an API for programmatic access to the validator. Validata is capable of being used for multiple community agreed standards, e.g. DCAT, the HCLS community profile, or the Open PHACTS guidelines, and there are currently deployments to support each of these. Validata can be easily repurposed for different deployments by providing it with a new ShEx schema. The Validata code is available from GitHub (https://github.com/HW-SWeL/Validata).
Presentation given at SDSVoc https://www.w3.org/2016/11/sdsvoc
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
Linked Open Data Libraries Archives Museums. This presentation is a basic overview of what LOD is and what technologies are needed to ensure the metadata around your collections is machine readable.
Scientists’ Hard Drives, Databases, and Blogs: Preservation Intent and Source...Trevor Owens
Carl Sagan’s WordPerfect files, simulations emailed to Edward Lorenz, a database application from the National Library of Medicine, a collection of science blogs, a database of interstellar distances; each of these digital artifacts have been acquired by archives and special collections. Born digital primary sources are no longer a future concern for archivists, librarians, curators and historians. As historians of science turn their attention to the late 20th and early 21st century, they will need to work from these born-digital primary sources. We have already accumulated a significant born digital past and it’s time for work with born digital primary sources to become mainstream. This presentation will give a quick tour of individual born digital artifacts toward two goals. First, I argue for the need for archivists, curators and librarians to reflexively develop approaches to establishing preservation intent for digital content grounded in a dialog with the nature of a given set of digital objects and it’s future research use. Second, for historians, I suggest how trends in computational analysis of information in the digital humanities should be combined with approaches from digital forensics and new media studies to establish historiographic practices for born-digital source criticism. I conclude by suggesting the kinds of technical skills archivists, librarians, curators and historians working with these materials are going to need to develop. Just as historians working with premodern documents require language and paleography skills, historians working with digital artefacts will increasingly need to understand the inscription processes of hard drives, the provenance created by web crawlers, and how to read relational databases of varying vintages.
Consuming Linked Data by Humans - WWW2010Juan Sequeda
These are the Consuming Linked Data by Humans slides that we presented at the Consuming Linked Data tutorial at WWW2010 in Raleigh, NC on April 26, 2010
Google's recent announcement that it will support the use of microformats in their search opens up new possibilities for librarians and library technologists to support the goals of the semantic web; namely to provide better access, reuse and recombinations of library resources and services on the open web. This lightning talk introduces the semantic web and semantic markup technologies.
Presentation delivered in the context of the Agricultural Data Interoperability WG meeeting, during the RDA 3rd Plenary Meeting in Dublin, Ireland. 26/3/2014.
The presentation is mostly focused on the work done by the agINFRA project towards proposing a methodology for the definition of Germplasm descriptors as RDF, based on the existing work of experts in the field and making use of the existing effort in this direction.
This tutorial explains the Data Web vision, some preliminary standards and technologies as well as some tools and technological building blocks developed by AKSW research group from Universität Leipzig.
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionRonald Ashri
The Physics Department of the University of Cagliari and the Linkalab Group invited me to talk about the Semantic Web and Linked Data - this is simply an introduction to the technologies involved.
The CIARD RING, a global directory of datasets for agriculture, by Valeria P...CIARD Movement
Presentation delivered at the Agricultural Data Interoperability Interest Group -- Research Data Alliance (RDA) 4th Plenary Meeting -- Amsterdam, September 2014
Johannes Keizer presented the outcomes of the eROSA project with researchers from the Agricultural Information Institute of CAAS (Chinese Academy of Agricultural Science)
3. Availability is not enough!! Complex information needs for agricultural development through research and innovation cannot be met by simply making information available. HTML HTML HTML Users HTML HTML HTML TECA Best practices Country profiles CARIS WISARD AGRIS HTML Country NARS HTML ICARDA HTML AiDA HTML Crop database HTML OPACs Disconnected repositories We need to know if a certain technology has been used in a specific country and in a dryland area for a specific crop and if there are related projects currently ongoing and where we can find the project outputs?
23. CIARD RING: value added services CARIS / WISARD Users TECA Best practices Country profiles AGRIS Country NARS ICARDA AiDA Crop database Geo-ontology Crop ontology Organizations Directory Agrovoc OA gateway OPAC We need to know if a certain technology has been used in a specific country and in a dryland area for a specific crop and if there are related projects currently ongoing and where we can find the project outputs? gateway gateway gateway