Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this tutorial is to explain elements of the HCLS community profile and to enable users to craft and validate descriptions for datasets of interest.
Supporting Dataset Descriptions in the Life SciencesAlasdair Gray
Machine processable descriptions of datasets can help make data more FAIR; that is Findable, Accessible, Interoperable, and Reusable. However, there are a variety of metadata profiles for describing datasets, some specific to the life sciences and others more generic in their focus. Each profile has its own set of properties and requirements as to which must be provided and which are more optional. Developing a dataset description for a given dataset to conform to a specific metadata profile is a challenging process.
In this talk, I will give an overview of some of the dataset description specifications that are available. I will discuss the difficulties in writing a dataset description that conforms to a profile and the tooling that I've developed to support dataset publishers in creating metadata description and validating them against a chosen specification.
Seminar talk given at the EBI on 5 April 2017
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsAlasdair Gray
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets.
The goal of this presentation is to give an overview of the HCLS Community Profile and explain how it extends and builds upon other approaches.
Presentation given at SDSVoc (https://www.w3.org/2016/11/sdsvoc/)
An Identifier Scheme for the Digitising Scotland ProjectAlasdair Gray
The Digitising Scotland project is having the vital records of Scotland transcribed from images of the original handwritten civil registers . Linking the resulting dataset of 24 million vital records covering the lives of 18 million people is a major challenge requiring improved record linkage techniques. Discussions within the multidisciplinary, widely distributed Digitising Scotland project team have been hampered by the teams in each of the institutions using their own identification scheme. To enable fruitful discussions within the Digitising Scotland team, we required a mechanism for uniquely identifying each individual represented on the certificates. From the identifier it should be possible to determine the type of certificate and the role each person played. We have devised a protocol to generate for any individual on the certificate a unique identifier, without using a computer, by exploiting the National Records of Scotland•À_s registration districts. Importantly, the approach does not rely on the handwritten content of the certificates which reduces the risk of the content being misread resulting in an incorrect identifier. The resulting identifier scheme has improved the internal discussions within the project. This paper discusses the rationale behind the chosen identifier scheme, and presents the format of the different identifiers. The work reported in the paper was supported by the British ESRC under grants ES/K00574X/1(Digitising Scotland) and ES/L007487/1 (Administrative Data Research Center - Scotland).
Validata: A tool for testing profile conformanceAlasdair Gray
Validata (http://hw-swel.github.io/Validata/) is an online web application for validating a dataset description expressed in RDF against a community profile expressed as a Shape Expression (ShEx). Additionally it provides an API for programmatic access to the validator. Validata is capable of being used for multiple community agreed standards, e.g. DCAT, the HCLS community profile, or the Open PHACTS guidelines, and there are currently deployments to support each of these. Validata can be easily repurposed for different deployments by providing it with a new ShEx schema. The Validata code is available from GitHub (https://github.com/HW-SWeL/Validata).
Presentation given at SDSVoc https://www.w3.org/2016/11/sdsvoc
Our speaker, Joshua Mandel, will provide a lightning tour of Fast Healthcare Interoperability Resources (FHIR), an emerging clinical data standard, with a focus on its resource-oriented approach, and a discussion of how FHIR intersects with the Semantic Web. We'll look at how FHIR represents links between entities; how FHIR represents concepts from standards-based vocabularies; and how a set of FHIR instance data can be represented in RDF.
In our series on The Yosemite Project, we explore RDF as a data standard for health data. In this presentation, we will discuss with Claude Nanjo, a Software Architect at Cognitive Medical Systems, ways to expose clinical knowledge as OWL and RDF resources on the Web in order to promote greater convergence in the representation of health knowledge in the longer term. We will also explore how one might rally and coordinate the community to seed the Web with a core set of high-value resources and technologies that could greatly enhance health interoperability.
Supporting Dataset Descriptions in the Life SciencesAlasdair Gray
Machine processable descriptions of datasets can help make data more FAIR; that is Findable, Accessible, Interoperable, and Reusable. However, there are a variety of metadata profiles for describing datasets, some specific to the life sciences and others more generic in their focus. Each profile has its own set of properties and requirements as to which must be provided and which are more optional. Developing a dataset description for a given dataset to conform to a specific metadata profile is a challenging process.
In this talk, I will give an overview of some of the dataset description specifications that are available. I will discuss the difficulties in writing a dataset description that conforms to a profile and the tooling that I've developed to support dataset publishers in creating metadata description and validating them against a chosen specification.
Seminar talk given at the EBI on 5 April 2017
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsAlasdair Gray
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets.
The goal of this presentation is to give an overview of the HCLS Community Profile and explain how it extends and builds upon other approaches.
Presentation given at SDSVoc (https://www.w3.org/2016/11/sdsvoc/)
An Identifier Scheme for the Digitising Scotland ProjectAlasdair Gray
The Digitising Scotland project is having the vital records of Scotland transcribed from images of the original handwritten civil registers . Linking the resulting dataset of 24 million vital records covering the lives of 18 million people is a major challenge requiring improved record linkage techniques. Discussions within the multidisciplinary, widely distributed Digitising Scotland project team have been hampered by the teams in each of the institutions using their own identification scheme. To enable fruitful discussions within the Digitising Scotland team, we required a mechanism for uniquely identifying each individual represented on the certificates. From the identifier it should be possible to determine the type of certificate and the role each person played. We have devised a protocol to generate for any individual on the certificate a unique identifier, without using a computer, by exploiting the National Records of Scotland•À_s registration districts. Importantly, the approach does not rely on the handwritten content of the certificates which reduces the risk of the content being misread resulting in an incorrect identifier. The resulting identifier scheme has improved the internal discussions within the project. This paper discusses the rationale behind the chosen identifier scheme, and presents the format of the different identifiers. The work reported in the paper was supported by the British ESRC under grants ES/K00574X/1(Digitising Scotland) and ES/L007487/1 (Administrative Data Research Center - Scotland).
Validata: A tool for testing profile conformanceAlasdair Gray
Validata (http://hw-swel.github.io/Validata/) is an online web application for validating a dataset description expressed in RDF against a community profile expressed as a Shape Expression (ShEx). Additionally it provides an API for programmatic access to the validator. Validata is capable of being used for multiple community agreed standards, e.g. DCAT, the HCLS community profile, or the Open PHACTS guidelines, and there are currently deployments to support each of these. Validata can be easily repurposed for different deployments by providing it with a new ShEx schema. The Validata code is available from GitHub (https://github.com/HW-SWeL/Validata).
Presentation given at SDSVoc https://www.w3.org/2016/11/sdsvoc
Our speaker, Joshua Mandel, will provide a lightning tour of Fast Healthcare Interoperability Resources (FHIR), an emerging clinical data standard, with a focus on its resource-oriented approach, and a discussion of how FHIR intersects with the Semantic Web. We'll look at how FHIR represents links between entities; how FHIR represents concepts from standards-based vocabularies; and how a set of FHIR instance data can be represented in RDF.
In our series on The Yosemite Project, we explore RDF as a data standard for health data. In this presentation, we will discuss with Claude Nanjo, a Software Architect at Cognitive Medical Systems, ways to expose clinical knowledge as OWL and RDF resources on the Web in order to promote greater convergence in the representation of health knowledge in the longer term. We will also explore how one might rally and coordinate the community to seed the Web with a core set of high-value resources and technologies that could greatly enhance health interoperability.
Yosemite Project - Part 3 - Transformations for Integrating VA data with FHIR...DATAVERSITY
In our series on The Yosemite Project, we explore RDF as a data standard for health data. In this installment, we will hear from Rafael Richards, Physician Informatician, Office of Informatics and Analytics in the Veterans Health Administration (VHA), about “Transformations for Integrating VA data with FHIR in RDF.”
The VistA EHR has its own data model and vocabularies for representing healthcare data. This webinar describes how SPARQL Inference Notation (SPIN) can be used to translate VistA data to the data represented used by FHIR, an emerging interchange standard.
WEBINAR: The Yosemite Project: An RDF Roadmap for Healthcare Information Inte...DATAVERSITY
Interoperability of electronic healthcare information remains an enormous challenge in spite of 100+ available healthcare information standards. This webinar explains the Yosemite Project, whose mission is to achieve semantic interoperability of all structured healthcare information through RDF as a common semantic foundation. It explains the rationale and technical strategy of the Yosemite Project, and describes how RDF and related standards address a two-pronged strategy for semantic interoperability: facilitating collaborative standards convergence whenever possible, and crowd-sourced data translations when necessary.
The DataTags System: Sharing Sensitive Data with ConfidenceMerce Crosas
This talk was part of a session at the Research Data Alliance (RDA) 8th Plenary on Privacy Implications of Research Data Sets, during International Data Week 2016:
https://rd-alliance.org/rda-8th-plenary-joint-meeting-ig-domain-repositories-wg-rdaniso-privacy-implications-research-data
Slides in Merce Crosas site:
http://scholar.harvard.edu/mercecrosas/presentations/datatags-system-sharing-sensitive-data-confidence
Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing p...GigaScience, BGI Hong Kong
Jesse Xiao at the Data Publishing session at CODATA2017: Updates to the GigaDB open access data publishing platform. Wednesday 11th October in St Petersburg, Russia
Metadata as Linked Data for Research Data Repositoriesandrea huang
“Every man has his own cosmology and who can say that his own is right.” said by Einstein. This is also true when we come to understand data semantics that one data may be different interpreted by different data creators, curators and re-users. Then, how do we build a better research data repository?
We start with the point made by Willis, C., Greenberg, J., & White, H. (2012) that the metadata of research data increases the access to and reuse of the data. And Stanford, Harvard, and Cornell believe the use of linked data technologies is a promising method to gather contextual information about research resources.
To look for inspiration tools that can meet the urgent needs of innovative solutions providing feature-rich services for helping data publishing such as visualization, validation & reuse in different applications by research repositories (Assante, et.al, 2016), the CKAN (Comprehensive Knowledge Archive Network) as a major solution that makes linked metadata available, citable, and validated becomes our first choice.
Original file: http://m.odw.tw/u/odw/m/metadata-as-linked-data-for-research-data-repositories/
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can EditDario Taraborelli
Slides for my September 23 talk on Wikidata and WikiCite – NIH Frontiers in Data Science lecture series.
Persistent URL: https://dx.doi.org/10.6084/m9.figshare.3850821
Linked Data for improved organization of research dataSamuel Lampa
Slides for a talk at a Farmbio BioScience Seminar May 18, 2018, at http://farmbio.uu.se introducing Linked Data as a way to manage research data in a way that can better keep track of provenance, make its semantics more explicit, and make it more easily integrated with other data, and consumed by others, both humans and machines.
Sharing data with lightweight data standards, such as schema.org and bioschemas. The Knetminer case, an application for the agrifood domain and molecular biology.
Presented at Open Data Sicilia (#ODS2021)
HRGRN: enabling graph search and integrative analysis of Arabidopsis signalin...Araport
The biological networks controlling plant signal transduction, metabolism and gene regulation are composed of not only genes, RNA, protein and compounds but also the complicated interactions among them. Yet, even in the most thoroughly studied model plant Arabidopsis thaliana, the knowledge regarding these interactions are scattered throughout literatures and various public databases. Thus, new scientific discovery by exploring these complex and heterogeneous data remains a challenge task for biologists.
We developed a graph-search empowered platform named HRGRN to search known and, more importantly, discover the novel relationships among genes in Arabidopsis biological networks. The HRGRN includes over 51,000 “nodes” that represent very large sets of genes, proteins, small RNAs, and compounds and approximately 150,000 “edges” that are classified into nine types of interactions (interactions between proteins, compounds and proteins, transcription factors (TFs) and their downstream target genes, small RNAs and their target genes, kinases and downstream target genes, transporters and substrates, substrate/product compounds and enzymes, as well as gene pairs with similar expression patterns to provide deep insight into gene-gene relationships) to comprehensively model and represent the complex interactions between nodes. .
The HRGRN allows users to discover novel interactions between genes and/or pathways, and build sub-networks from user-specified seed nodes by searching the comprehensive collections of interactions stored in its back-end graph databases using graph traversal algorithms. The HRGRN database is freely available at http://plantgrn.noble.org/hrgrn/. Currently, we are collaborating the Araport team to develop REST-like web services and provide the HRGRN’s graph search functions to Araport system.
A single interface for accessing life sciences (LS) data is a natural consequence to master the data deluge in this domain. The data in the LS requires integration and current integrative solutions increasingly rely on the federation of queries for distributed resources. We introduce a federated query processing system name ``BioFed", customised for LS-LOD. BioFed federates SPARQL queries over more than 130 public SPARQL endpoints.
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Merce Crosas
Presentation for the NFAIS Webinar series: Open Data Fostering Open Science: Meeting Researchers' Needs
http://www.nfais.org/index.php?option=com_mc&view=mc&mcid=72&eventId=508850&orgId=nfais
This webinar explains the service, covers what publishers need to participate and answers any questions you may have. This webinar was held on November 10, 2015.
Dataset Descriptions in Open PHACTS and HCLSAlasdair Gray
This presentation gives an overview of the dataset description specification developed in the Open PHACTS project (http://www.openphacts.org/). The creation of the specification was driven by a real need within the project to track the datasets used.
Details of the dataset metadata captured and the vocabularies used to model this metadata are given together with the tools developed to enable the specification's uptake.
Over the course of the last 12 months, the W3C Healthcare and Life Science Interest Group have been developing a community profile for dataset descriptions. This has drawn on the ideas developed in the Open PHACTS specification. A brief overview of the forthcoming community profile is given in the presentation.
This presentation was given to the Network Data Exchange project http://www.ndexbio.org/ on 2 April 2014.
Re-using Media on the Web: Media fragment re-mixing and playoutMediaMixerCommunity
A number of novel application ideas will be introduced based on the media fragment creation, specification and rights management technologies. Semantic search and retrieval allows us to organize sets of fragments by topical or conceptual relevance. These fragment sets can then be played out in a non-linear fashion to create a new media re-mix. We look at a server-client implementation supporting Media Fragments, before allowing the participants to take the sets of media they have selected and create their own re-mix.
Yosemite Project - Part 3 - Transformations for Integrating VA data with FHIR...DATAVERSITY
In our series on The Yosemite Project, we explore RDF as a data standard for health data. In this installment, we will hear from Rafael Richards, Physician Informatician, Office of Informatics and Analytics in the Veterans Health Administration (VHA), about “Transformations for Integrating VA data with FHIR in RDF.”
The VistA EHR has its own data model and vocabularies for representing healthcare data. This webinar describes how SPARQL Inference Notation (SPIN) can be used to translate VistA data to the data represented used by FHIR, an emerging interchange standard.
WEBINAR: The Yosemite Project: An RDF Roadmap for Healthcare Information Inte...DATAVERSITY
Interoperability of electronic healthcare information remains an enormous challenge in spite of 100+ available healthcare information standards. This webinar explains the Yosemite Project, whose mission is to achieve semantic interoperability of all structured healthcare information through RDF as a common semantic foundation. It explains the rationale and technical strategy of the Yosemite Project, and describes how RDF and related standards address a two-pronged strategy for semantic interoperability: facilitating collaborative standards convergence whenever possible, and crowd-sourced data translations when necessary.
The DataTags System: Sharing Sensitive Data with ConfidenceMerce Crosas
This talk was part of a session at the Research Data Alliance (RDA) 8th Plenary on Privacy Implications of Research Data Sets, during International Data Week 2016:
https://rd-alliance.org/rda-8th-plenary-joint-meeting-ig-domain-repositories-wg-rdaniso-privacy-implications-research-data
Slides in Merce Crosas site:
http://scholar.harvard.edu/mercecrosas/presentations/datatags-system-sharing-sensitive-data-confidence
Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing p...GigaScience, BGI Hong Kong
Jesse Xiao at the Data Publishing session at CODATA2017: Updates to the GigaDB open access data publishing platform. Wednesday 11th October in St Petersburg, Russia
Metadata as Linked Data for Research Data Repositoriesandrea huang
“Every man has his own cosmology and who can say that his own is right.” said by Einstein. This is also true when we come to understand data semantics that one data may be different interpreted by different data creators, curators and re-users. Then, how do we build a better research data repository?
We start with the point made by Willis, C., Greenberg, J., & White, H. (2012) that the metadata of research data increases the access to and reuse of the data. And Stanford, Harvard, and Cornell believe the use of linked data technologies is a promising method to gather contextual information about research resources.
To look for inspiration tools that can meet the urgent needs of innovative solutions providing feature-rich services for helping data publishing such as visualization, validation & reuse in different applications by research repositories (Assante, et.al, 2016), the CKAN (Comprehensive Knowledge Archive Network) as a major solution that makes linked metadata available, citable, and validated becomes our first choice.
Original file: http://m.odw.tw/u/odw/m/metadata-as-linked-data-for-research-data-repositories/
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can EditDario Taraborelli
Slides for my September 23 talk on Wikidata and WikiCite – NIH Frontiers in Data Science lecture series.
Persistent URL: https://dx.doi.org/10.6084/m9.figshare.3850821
Linked Data for improved organization of research dataSamuel Lampa
Slides for a talk at a Farmbio BioScience Seminar May 18, 2018, at http://farmbio.uu.se introducing Linked Data as a way to manage research data in a way that can better keep track of provenance, make its semantics more explicit, and make it more easily integrated with other data, and consumed by others, both humans and machines.
Sharing data with lightweight data standards, such as schema.org and bioschemas. The Knetminer case, an application for the agrifood domain and molecular biology.
Presented at Open Data Sicilia (#ODS2021)
HRGRN: enabling graph search and integrative analysis of Arabidopsis signalin...Araport
The biological networks controlling plant signal transduction, metabolism and gene regulation are composed of not only genes, RNA, protein and compounds but also the complicated interactions among them. Yet, even in the most thoroughly studied model plant Arabidopsis thaliana, the knowledge regarding these interactions are scattered throughout literatures and various public databases. Thus, new scientific discovery by exploring these complex and heterogeneous data remains a challenge task for biologists.
We developed a graph-search empowered platform named HRGRN to search known and, more importantly, discover the novel relationships among genes in Arabidopsis biological networks. The HRGRN includes over 51,000 “nodes” that represent very large sets of genes, proteins, small RNAs, and compounds and approximately 150,000 “edges” that are classified into nine types of interactions (interactions between proteins, compounds and proteins, transcription factors (TFs) and their downstream target genes, small RNAs and their target genes, kinases and downstream target genes, transporters and substrates, substrate/product compounds and enzymes, as well as gene pairs with similar expression patterns to provide deep insight into gene-gene relationships) to comprehensively model and represent the complex interactions between nodes. .
The HRGRN allows users to discover novel interactions between genes and/or pathways, and build sub-networks from user-specified seed nodes by searching the comprehensive collections of interactions stored in its back-end graph databases using graph traversal algorithms. The HRGRN database is freely available at http://plantgrn.noble.org/hrgrn/. Currently, we are collaborating the Araport team to develop REST-like web services and provide the HRGRN’s graph search functions to Araport system.
A single interface for accessing life sciences (LS) data is a natural consequence to master the data deluge in this domain. The data in the LS requires integration and current integrative solutions increasingly rely on the federation of queries for distributed resources. We introduce a federated query processing system name ``BioFed", customised for LS-LOD. BioFed federates SPARQL queries over more than 130 public SPARQL endpoints.
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Merce Crosas
Presentation for the NFAIS Webinar series: Open Data Fostering Open Science: Meeting Researchers' Needs
http://www.nfais.org/index.php?option=com_mc&view=mc&mcid=72&eventId=508850&orgId=nfais
This webinar explains the service, covers what publishers need to participate and answers any questions you may have. This webinar was held on November 10, 2015.
Dataset Descriptions in Open PHACTS and HCLSAlasdair Gray
This presentation gives an overview of the dataset description specification developed in the Open PHACTS project (http://www.openphacts.org/). The creation of the specification was driven by a real need within the project to track the datasets used.
Details of the dataset metadata captured and the vocabularies used to model this metadata are given together with the tools developed to enable the specification's uptake.
Over the course of the last 12 months, the W3C Healthcare and Life Science Interest Group have been developing a community profile for dataset descriptions. This has drawn on the ideas developed in the Open PHACTS specification. A brief overview of the forthcoming community profile is given in the presentation.
This presentation was given to the Network Data Exchange project http://www.ndexbio.org/ on 2 April 2014.
Re-using Media on the Web: Media fragment re-mixing and playoutMediaMixerCommunity
A number of novel application ideas will be introduced based on the media fragment creation, specification and rights management technologies. Semantic search and retrieval allows us to organize sets of fragments by topical or conceptual relevance. These fragment sets can then be played out in a non-linear fashion to create a new media re-mix. We look at a server-client implementation supporting Media Fragments, before allowing the participants to take the sets of media they have selected and create their own re-mix.
First Steps in Semantic Data Modelling and Search & Analytics in the CloudOntotext
This webinar will break the roadblocks that prevent many from reaping the benefits of heavyweight Semantic Technology in small scale projects. We will show you how to build Semantic Search & Analytics proof of concepts by using managed services in the Cloud.
This slides I've used on talk about Semantic Web use-case. Not all know what exactly Semantic Web is about. So I've created set of slides showing this in a simple and correct way. Use-case slides are removed on this public available slides. Animated version here goo.gl/qKoF6k . Contact me for sources!
Introduction to Apache Any23. Any23 is a library, a Web Service and a Command Line Tool written in Java, that extracts structured RDF data from a variety of Web documents and markup formats.
Any23 is an Apache Software Foundation top level project.
20th Feb 2020 json-ld-rdf-im-proposal.pdfMichal Miklas
originally published as part of Egeria project:
https://github.com/odpi/data-governance/blob/master/work-groups/semantic/documents/20th%20Feb%202020%20json-ld-rdf-im-proposal.pptx
used in this article:
https://www.linkedin.com/pulse/rdfowl-best-enough-any-model-michal-miklas
Flexible metadata schemes for research data repositories - CLARIN Conference'21vty
The development of the Common Framework in Dataverse and the CMDI use case. Building AI/ML based workflow for the prediction and linking concepts from external controlled vocabularies to the CMDI metadata values.
Flexible metadata schemes for research data repositories - Clarin Conference...Vyacheslav Tykhonov
The development of the Common Framework in Dataverse and the CMDI use case. Building AI/ML based workflow for the prediction and linking concepts from external controlled vocabularies to the CMDI metadata values.
Introduction to metadata cleansing using SPARQL update queriesEuropean Commission
What is it about?
-How to transform your metadata using simple SPARQL Update queries;
-How to conform to the ADMS-AP to get your interoperability solutions ready to be shared on Joinup;
-The main types of errors that you could face when uploading metadata of interoperability solutions on Joinup.
This presentation addresses the main issues of Linked Data and scalability. In particular, it provides gives details on approaches and technologies for clustering, distributing, sharing, and caching data. Furthermore, it addresses the means for publishing data trough could deployment and the relationship between Big Data and Linked Data, exploring how some of the solutions can be transferred in the context of Linked Data.
HDL - Towards A Harmonized Dataset Model for Open Data PortalsAhmad Assaf
The Open Data movement triggered an unprecedented amount of data published in a wide range of domains. Governments and corporations around the world are encouraged to publish, share, use and integrate Open Data. There are many areas where one can see the added value of Open Data, from transparency and self-empowerment to improving efficiency, effectiveness and decision making. This growing amount of data requires rich metadata in order to reach its full potential. This metadata enables dataset discovery, understanding, integration and maintenance. Data portals, which are considered to be datasets' access points, offer metadata represented in different and heterogenous models. In this paper, we first conduct a unique and comprehensive survey of seven metadata models: CKAN, DKAN, Public Open Data, Socrata, VoID, DCAT and Schema.org. Next, we propose a Harmonized Dataset modeL (HDL) based on this survey. We describe use cases that show the benefits of providing rich metadata to enable dataset discovery, search and spam detection
How to describe a dataset. Interoperability issuesValeria Pesce
Presented by Valeria Pesce during the pre-meeting of the Agricultural Data Interoperability Interest Group (IGAD) of the Research Data Alliance (RDA), held on 21 and 22 September 2015 in Paris at INRA.
Doctoral Examination at the Karlsruhe Institute of Technology (08.07.2016)Dr.-Ing. Thomas Hartmann
In this thesis, a validation framework is introduced that enables to consistently execute RDF-based constraint languages on RDF data and to formulate constraints of any type. The framework reduces the representation of constraints to the absolute minimum, is based on formal logics, consists of a small lightweight vocabulary, and ensures consistency regarding validation results and enables constraint transformations for each constraint type across RDF-based constraint languages.
Similar to Tutorial: Describing Datasets with the Health Care and Life Sciences Community Profile (20)
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...Alasdair Gray
In recent years there has been a reproducibility crisis in science. Computational notebooks, such as Jupyter, have been touted as one solution to this problem. However, when executing analyses over live SPARQL endpoints, we get different answers depending upon when the analysis in the notebook was executed. In this paper, we identify some of the issues discovered in trying to develop a reproducible analysis over a collection of biomedical data sources and suggest some best practice to overcome these issues.
Bioschemas Community: Developing profiles over Schema.org to make life scienc...Alasdair Gray
The Bioschemas community (http://bioschemas.org) is a loose collaboration formed by a wide range of life science resource providers and informaticians. The community is developing profiles over Schema.org to enable life science resources such as data about a specific protein, sample, or training event, to be more discoverable on the web. While the content of well-known resources such as Uniprot (for protein data) are easily discoverable, there is a long tail of specialist resources that would benefit from embedding Schema.org markup in a standardised approach.
The community have developed twelve profiles for specific types of life science resources (http://bioschemas.org/specifications/), with another six at an early draft stage. For each profile, a set of use cases have been identified. These typically focus on search, but several facilitate lightweight data exchange to support data aggregators such as Identifiers.org, FAIRsharing.org, and BioSamples. The next stage of the development of a profile consists of mapping the terms used in the use cases to existing properties in Schema.org and domain ontologies. The properties are then prioritised in order to support the use cases, with a minimal set of about six properties identified, along with a larger set of recommended and optional properties. For each property, an expected cardinality is defined and where appropriate, object values are specified from controlled vocabularies. Before a profile is finalised, it must first be demonstrated that resources can deploy the markup.
In this talk, we will outline the progress that has been made by the Bioschemas Community in a single year through three hackathon events. We will discuss the processes followed by the Bioschemas Community to foster collaboration, and highlight the benefits and drawbacks of using open Google documents and spreadsheets to support the community develop the profiles. We will conclude by summarising future opportunities and directions for the community.
Presentation given at the Open PHACTS project symposium.
The slides give an overview of the data in the 2.0 Open PHACTS drug discovery platform and the challenges that have been faced in the Open PHACTS project to reach this stage.
This presentation was prepared for my faculty Christmas conference.
Abstract: For the last 11 months I have been working on a top secret project with a world renowned Scandinavian industry partner. We are now moving into the exciting operational phase of this project. I have been granted an early lifting of the embargo that has stopped me talking about this work up until now. I will talk about the data science behind this big data project and how semantic web technology has enabled the delivery of Project X.
Data Integration in a Big Data Context: An Open PHACTS Case StudyAlasdair Gray
Keynote presentation at the EU Ambient Assisted Living Forum workshop The Crusade for Big Data in the AAL Domain.
The presentation explores the Open PHACTS project and how it overcame various Big Data challenges.
Data is being generated all around us – from our smart phones tracking our movement through a city to the city itself sensing various properties and reacting to various conditions. However, to maximise the potential from all this data, it needs to be combined and coerced into models that enable analysis and interpretation. In this talk I will give an overview of the techniques that I have developed for data integration: integrating streams of sensor data with background contextual data and supporting multiple interpretations of linking data together. At the end of the talk I will overview the work I will be conducting in the Administrative Data Research Centre for Scotland.
Many areas of scientific discovery rely on combining data from multiples data sources. However there are many challenges in linking data. This presentation highlights these challenges in the context of using Linked Data for environmental and social science databases.
Scientific lenses to support multiple views over linked chemistry dataAlasdair Gray
When are two entries about a small molecule in different datasets the same? If they have the same drug name, chemical structure, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied.
In this paper, we present an approach to enable applications to choose the equivalence criteria to apply between datasets. Thus, supporting multiple dynamic views over the Linked Data. For chemical data, we show that multiple sets of links can be automatically generated according to different equivalence criteria and published with semantic descriptions capturing their context and interpretation. This approach has been applied within a large scale public-private data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.
Scientific Lenses over Linked Data An approach to support multiple integrate...Alasdair Gray
When are two entries about a concept in different datasets the same? If they have the same name, properties, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied.
In this presentation, I will introduce Scientific lenses, an approach that enables applications to vary the equivalence conditions between linked datasets. They have been deployed in the Open PHACTS Discovery Platform – a large scale data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.
Describing Scientific Datasets: The HCLS Community ProfileAlasdair Gray
Big Data presents an exciting opportunity to pursue large-scale analyses over collections of data in order to uncover valuable insights across a myriad of fields and disciplines. Yet, as more and more data is made available, researchers are finding it increasingly difficult to discover and reuse these data. One problem is that data are insufficiently described to understand what they are or how they were produced. A second issue is that no single vocabulary provides all key metadata fields required to support basic scientific use cases. A third issue is that data catalogs and data repositories all use different metadata standards, if they use any standard at all, and this prevents easy search and aggregation of data. Therefore, we need a community profile to indicate what are the essential metadata, and the manner in which we can express it.
The W3C Health Care and Life Sciences Interest Group have developed such a community profile that defines the required properties to provide high-quality dataset descriptions that support finding, understanding, and reusing scientific data, i.e. making the data FAIR (Findable, Accessible, Interoperable and Re-usable – http://datafairport.org). The specification reuses many notions and vocabulary terms from Dublin Core, DCAT and VoID, with provenance and versioning information being provided by PROV-O and PAV. The community profile is based around a three tier model; the summary description captures catalogue style metadata about the dataset, each version of the dataset is described separately as are the various distribution formats of these versions. The resulting community profile is generic and applicable to a wide variety of scientific data.
Tools are being developed to help with the creation and validation of these descriptions. Several datasets including those from Bio2RDF, EBI and IntegBio are already moving to release descriptions conforming to the community profile.
SensorBench is a benchmark suite for wireless sensor networks. The design of wireless sensor network systems sits within a multi-dimensional design space, where it can be difficult to understand the implications of specific decisions and to identify optimal solutions. SensorBench enables the systematic analysis and comparison of different techniques and platforms, enabling both development and user communities to make well informed choices. The benchmark identifies key variables and performance metrics, and specifies experiments that explore how different types of task perform under different metrics for the controlled variables. The benchmark is demonstrated by its application on representative platforms.
Full details of the benchmark are available from http://dl.acm.org/citation.cfm?id=2618252 (DOI: 10.1145/2618243.2618252)
What are the research and technical challenges of linked data that are relevant to data science?
This presentation introduces the ideas of linked data using the BBC sport web site as an example. It then identifies several research challenges that remain to be addressed.
Computing Identity Co-Reference Across Drug Discovery DatasetsAlasdair Gray
This paper presents the rules used within the Open PHACTS (http://www.openphacts.org) Identity Management Service to compute co-reference chains across multiple datasets. The web of (linked) data has encouraged a proliferation of identifiers for the concepts captured in datasets; with each dataset using their own identifier. A key data integration challenge is linking the co-referent identifiers, i.e. identifying and linking the equivalent concept in every dataset. Exacerbating this challenge, the datasets model the data differently, so when is one representation truly the same as another? Finally, different users have their own task and domain specific notions of equivalence that are driven by their operational knowledge. Consumers of the data need to be able to choose the notion of operational equivalence to be applied for the context of their application. We highlight the challenges of automatically computing co-reference and the need for capturing the context of the equivalence. This context is then used to control the co-reference computation. Ultimately, the context will enable data consumers to decide which co-references to include in their applications.
Incorporating Commercial and Private Data into an Open Linked Data Platform f...Alasdair Gray
The Open PHACTS Discovery Platform aims to provide an integrated information space to advance pharmacological research in the area of drug discovery. Effective drug discovery requires comprehensive data coverage, i.e. integrating all available sources of pharmacology data. While many relevant data sources are available on the linked open data cloud, their content needs to be combined with that of commercial datasets and the licensing of these commercial datasets respected when providing access to the data. Additionally, pharmaceutical companies have built up their own extensive private data collections that they require to be included in their pharmacological dataspace. In this paper we discuss the challenges of incorporating private and commercial data into a linked dataspace: focusing on the modelling of these datasets and their interlinking. We also present the graph-based access control mechanism that ensures commercial and private datasets are only available to authorized users.
http://link.springer.com/chapter/10.1007/978-3-642-41338-4_5
Including Co-Referent URIs in a SPARQL QueryAlasdair Gray
Linked data relies on instance level links between potentially differing representations of concepts in multiple datasets. However, in large complex domains, such as pharmacology, the inter-relationship of data instances needs to consider the context (e.g. task, role) of the user and the assumptions they want to apply to the data. Such context is not taken into account in most linked data integration procedures. In this paper we argue that dataset links should be stored in a stand-off fashion, thus enabling different assumptions to be applied to the data links during query execution. We present the infrastructure developed for the Open PHACTS Discovery Platform to enable this and show through evaluation that the incurred performance cost is below the threshold of user perception.
http://ceur-ws.org/Vol-1034/BrenninkmeijerEtAl_COLD2013.pdf
Alice: "What version of ChEMBL are we using?"
Bob: "Er…let me check. It's going to take a while, I'll get back to you."
This simple question took us the best part of a month to resolve and involved several individuals. Knowing the provenance of your data is essential, especially when using large complex systems that process multiple datasets.
The underlying issues of this simple question motivated us to improve the provenance data in the Open PHACTS project. We developed a guideline for dataset descriptions where the metadata is carried with the data. In this talk I will highlight the challenges we faced and give an overview of our metadata guidelines.
Presentation given to the W3C Semantic Web for Health Care and Life Sciences Interest Group on 14 January 2013.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Tutorial: Describing Datasets with the Health Care and Life Sciences Community Profile
1. Describing Datasets with the
Health Care and Life Sciences
Community Profile
Alasdair Gray
Department of Computer Science
Heriot-Watt University
www.macs.hw.ac.uk/~ajg33/
A.J.G.Gray@hw.ac.uk
@gray_alasdair
Michel Dumontier
Stanford University
M. Scott Marshall
Netherlands Cancer Institute
2. Materials released
under CC-BY License
You are free to:
• Share — copy and redistribute the material in any medium or format
• Adapt — remix, transform, and build upon the material for any purpose,
even commercially.
The licensor cannot revoke these freedoms as long as you follow the license
terms.
Under the following terms:
• Attribution — You must give appropriate credit, provide a link to the
license, and indicate if changes were made. You may do so in any
reasonable manner, but not in any way that suggests the licensor endorses
you or your use.
4 December 2016 Alasdair Gray @gray_alasdair 2
3. Outline
• Dataset Descriptions
• Overview
• Use Cases
• Requirements
• Implementations
• Existing Vocabularies
• HCLS Community Profile
• Overview
• Example
• Modules
• Hands On
• Develop your own description
• Validation
• Overview
• Demonstration
• Hands On
• Validate your description
• RDF Dataset Statistics
• Overview
• SPARQL Queries
4 December 2016 Alasdair Gray @gray_alasdair 3
4. W3C HCLS Group
Dumontier M, Gray AJG, Marshall MS, et
al. (2016) The health care and life sciences
community profile for dataset descriptions.
PeerJ 4:e2331 https://doi.org/10.7717/peerj.2331
4 December 2016 Alasdair Gray @gray_alasdair 4
5. Use Cases
Overview of a couple of examples
4 December 2016 Alasdair Gray @gray_alasdair 5
6. FAIR Data Principles
Findable
• Global persistent identifier
• Rich metadata
• Store metadata in registries
Accessible
• Resolvable identifiers
• Metadata persists
• Machine and human access
Interoperable
• Open data format
• Modelled with FAIR compliant vocabularies
• Reference external data
Reusable
• Rich metadata
• Clear license
• Provenance
4 December 2016 Alasdair Gray @gray_alasdair 6
10. Data Cache
(Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON) Domain
Specific
Services
Identity
Resolution
Service
Identifier
Management
Service
“Adenosine
receptor 2a”
EC2.43.4
CS4532
P12374
CorePlatform
ChEMBL-
RDF
ChEMBL
v13
Chem2
Bio2RDF
SD
v13
v12
v2 or v8
10
Which version of ChEMBL?
11. Data Cache
(Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON) Domain
Specific
Services
Identity
Resolution
Service
Identifier
Management
Service
“Adenosine
receptor 2a”
EC2.43.4
CS4532
P12374
CorePlatform
ChEMBL-
RDF
ChEMBL
v13
Chem2
Bio2RDF
SD
v13
v12
v2 or v8
Open PHACTS
Discovery PlatformHistoric Use Case
~January 2012
Open PHACTS v2.1
ChEMBL 20
http://tiny.cc/ops-datasets
11
Which version of ChEMBL?
12. Challenges
• Datasets available
• In many versions over time
• In different formats
• From many mirrors/registries
• Datasets build on each other
• Files do not carry metadata
• Registries
• Can be out-of-date
• Can contain conflicting information
4 December 2016 Alasdair Gray @gray_alasdair 12
Scientists
require data
provenance!
14. Metadata Requirements
• What is the dataset about?
• Name and description
• Content themes
• Who produced the dataset?
• Publisher details
• Content author details
• When was the dataset created?
• Date of creation
• Date of publication
• Versioning
• Where can I get the dataset?
• Licence and rights
• Download URL
• SPARQL endpoint
• Web service
• How was the dataset created?
• Experimental methodology
• Post-processing
• Why was the dataset created?
• Motivation
4 December 2016 Alasdair Gray @gray_alasdair 14
15. HCLS Additional Requirements
Standard metadata requirements plus:
1. Resolvable identifiers for metadata
2. Descriptions of data identifiers
3. Data provenance
4. Data statistics
4 December 2016 Alasdair Gray @gray_alasdair 15
18. Dublin Core Metadata Initiative
Widely used
Broadly applicable
• Documents
• Datasets
✗Generic terms
✗Not comprehensive
✗No required properties
4 December 2016 Alasdair Gray @gray_alasdair 18
“Date: A point or period of time
associated with an event in the
lifecycle of the resource.”
19. 19Alasdair Gray @gray_alasdair
Metadata carried with data
• Directly embedded: void:inDataset
✗No versioning
✗No checklist of requisite fields
✗Only for RDF data
VoID: Vocabulary of Interlinked Datasets
4 December 2016
20. DCAT: Data Catalog
Separates Dataset and Distribution
✗No versioning
✗No prescribed properties
4 December 2016 Alasdair Gray @gray_alasdair 20
Fixed with DCAT-AP
DCAT-AP doesn’t meet use
case needs
31. Core Metadata (Title & description)
Element Property Value
Summary
Level
Version
Level
Distribution
Level
Type declaration rdf:type dctypes:Dataset MUST MUST SHOULD
Type declaration rdf:type
void:Dataset or
dcat:Distribution MUST NOT MUST NOT MUST
Title dct:title rdf:langString MUST MUST MUST
Alternative titles dct:alternative rdf:langString MAY MAY MAY
Description dct:description rdf:langString MUST MUST MUST
4 December 2016 Alasdair Gray @gray_alasdair 31
32. Core Metadata (Dates & contributors)
Element Property Value
Summary
Level
Version
Level
Distribution
Level
Date createddct:created
rdfs:Literal encoded using the relevant ISO 8601
Date and Time compliant string and typed using
the appropriate XML Schema datatype
MUST
NOT SHOULD SHOULD
Other dates
pav:createdOn or
pav:authoredOn or
pav:curatedOn
xsd:dateTime, xsd:date, xsd:gYearMonth, or
xsd:gYear
MUST
NOT MAY MAY
Creators dct:creator IRI
MUST
NOT MUST MUST
Contributors
dct:contributor or
pav:createdBy or
pav:authoredBy or
pav:curatedBy IRI
MUST
NOT MAY MAY
Date of issuedct:issued
rdfs:Literal encoded using the relevant ISO 8601
Date and Time compliant string and typed using
the appropriate XML Schema datatype
MUST
NOT SHOULD SHOULD
33. Core Metadata (Publisher and licence)
Element Property Value
Summary
Level
Version
Level
Distribution
Level
Publisher dct:publisher IRI MUST MUST MUST
HTML page foaf:page IRI SHOULD SHOULD SHOULD
Logo schemaorg:logo IRI SHOULD SHOULD SHOULD
License dct:license IRI MAY SHOULD MUST
Rights dct:rights rdf:langString MAY MAY MAY
4 December 2016 Alasdair Gray @gray_alasdair 33
34. Core Metadata (Content description)
Element Property Value
Summary
Level
Version
Level
Distribution
Level
Keywords dcat:keyword xsd:string MAY MAY MAY
Language dct:language
http://lexvo.org/id/iso
639-3/{tag} MUST NOT SHOULD SHOULD
References dct:references IRI MAY MAY MAY
Concept
descriptors dcat:theme
IRI of type
skos:Concept MAY MAY MAY
Vocabulary
used void:vocabulary IRI MUST NOT MUST NOTSHOULD
Standards used dct:conformsTo IRI MUST NOT MAY SHOULD
Citations cito:citesAsAuthority IRI MAY MAY MAY
Related
material rdfs:seeAlso IRI MAY MAY MAY
Partitions dct:hasPart IRI MAY MAY MUST NOT
4 December 2016 Alasdair Gray @gray_alasdair 34
35. Identifiers
Element Property Value
Summary
Level
Version
Level
Distribution
Level
Preferred prefix idot:preferredPrefix xsd:string MAY MAY MAY
Alternate prefix idot:alternatePrefix xsd:string MAY MAY MAY
Identifier pattern idot:identifierPattern xsd:string MUST NOT MUST NOT MAY
URI pattern void:uriRegexPattern xsd:string MUST NOT MUST NOT MAY
File access pattern idot:accessPattern idot:AccessPattern MUST NOT MUST NOT MAY
Example identifier idot:exampleIdentifier xsd:string MUST NOT MUST NOT SHOULD
Example resource void:exampleResource IRI MUST NOT MUST NOT SHOULD
4 December 2016 Alasdair Gray @gray_alasdair 35
36. Provenance and Change (Versioning)
Element Property Value
Summary
Level
Version
Level
Distribution
Level
Version identifier pav:version xsd:string MUST NOT MUST SHOULD
Version linking dct:isVersionOf IRI MUST NOT MUST MUST NOT
Version linking pav:previousVersion IRI MUST NOT SHOULD SHOULD
Version linking pav:hasCurrentVersion IRI MAY MUST NOT MUST NOT
4 December 2016 Alasdair Gray @gray_alasdair 36
37. Provenance and Change
Element Property Value
Summary
Level
Version
Level
Distribution
Level
Data source
provenance
dct:source or
pav:retrievedFrom or
prov:wasDerivedFrom IRI MUST NOT SHOULD SHOULD
Item listing sio:has-data-item IRI MUST NOT MUST NOT MAY
Creation tool pav:createdWith IRI MUST NOT SHOULD SHOULD
Update frequency dct:accrualPeriodicity
IRI of type
dctypes:Frequency SHOULD MUST NOT MUST NOT
4 December 2016 Alasdair Gray @gray_alasdair 37
38. Availability and Distributions
Element Property Value
Summary
Level
Version
Level
Distribution
Level
Distribution description dcat:distribution
IRI of Distribution
Level description MUST NOT SHOULD MUST NOT
File format dct:format IRI or xsd:String MUST NOT MUST NOT MUST
File directory dcat:accessURL IRI MAY MAY MAY
File URL dcat:downloadURL IRI MUST NOT MUST NOT SHOULD
Byte size dcat:byteSize xsd:decimal MUST NOT MUST NOT SHOULD
RDF File URL void:dataDump IRI MUST NOT MUST NOT SHOULD
SPARQL endpoint void:sparqlEndpoint IRI SHOULD
SHOULD
NOT SHOULD NOT
Documentation dcat:landingPage IRI MUST NOT MAY MAY
Linkset void:subset IRI MUST NOT MUST NOT SHOULD
4 December 2016 Alasdair Gray @gray_alasdair 38
39. Statistics (RDF Core)
Element Property Value
Summary
Level
Version
Level
Distribution
Level
# of triples void:triples xsd:integer MUST NOT MUST NOT SHOULD
# of typed entities void:entities xsd:integer MUST NOT MUST NOT SHOULD
# of subjects void:distinctSubjects xsd:integer MUST NOT MUST NOT SHOULD
# of properties void:properties xsd:integer MUST NOT MUST NOT SHOULD
# of objects void:distinctObjects xsd:integer MUST NOT MUST NOT SHOULD
# of classes void:classPartition IRI MUST NOT MUST NOT SHOULD
# of literals void:classPartition IRI MUST NOT MUST NOT SHOULD
# of RDF graphs void:classPartition IRI MUST NOT MUST NOT SHOULD
40. Statistics (RDF Complete)
Element Property Value
Summary
Level
Version
Level
Distribution
Level
class frequency void:classPartition IRI MUST NOT MUST NOT MAY
property frequency void:propertyPartition IRI MUST NOT MUST NOT MAY
property and subject
types void:propertyPartition IRI MUST NOT MUST NOT MAY
property and object
types void:propertyPartition IRI MUST NOT MUST NOT MAY
property and literals void:propertyPartition IRI MUST NOT MUST NOT MAY
property subject and
object types void:propertyPartition IRI MUST NOT MUST NOT MAY
4 December 2016 Alasdair Gray @gray_alasdair 40
41. Hands on
Create your own dataset description
If you don’t have your own, pick one of the following two
4 December 2016 Alasdair Gray @gray_alasdair 41
44. Validation Service
Andrew Beveridge
Jacob Baungard
Hansen
Johnny Val
Leif Gehrmann
Roisin Farmer
Sunil Khutan
Tomas Robertson
4 December 2016 Alasdair Gray @gray_alasdair 44
45. Example Constraint
4 December 2016 46
• Shape
• A Dataset Summary
• MUST be declared to be of type dctype:Dataset
• MUST have a dcterms:title as a language typed string
• MUST NOT have dcterms:created date
<Dataset> rdf:langString
.
✗
Alasdair Gray @gray_alasdair
Dates are associated
with versions in HCLS
55. http://www.w3.org/2015/03/ShExValidata/
• Try the ChEMBL example
• Play with Options
• Add resources
• Switch between open/closed shapes
• Change requirement level
• Check your own work
• Try descriptions from
• Riken Metadatabase
http://metadb.riken.jp/archives/HCLSProfile/HCLSProfile_MetaDB_ALL.ttl
• PHI-Base
https://www.dropbox.com/s/tghogwagn962tmi/Metadata.ttl?dl=0
4 December 2016 Alasdair Gray @gray_alasdair 57
57. Statistics (RDF Core)
Element Property Value
Summary
Level
Version
Level
Distribution
Level
# of triples void:triples xsd:integer MUST NOT MUST NOT SHOULD
# of typed entities void:entities xsd:integer MUST NOT MUST NOT SHOULD
# of subjects void:distinctSubjects xsd:integer MUST NOT MUST NOT SHOULD
# of properties void:properties xsd:integer MUST NOT MUST NOT SHOULD
# of objects void:distinctObjects xsd:integer MUST NOT MUST NOT SHOULD
# of classes void:classPartition IRI MUST NOT MUST NOT SHOULD
# of literals void:classPartition IRI MUST NOT MUST NOT SHOULD
# of RDF graphs void:classPartition IRI MUST NOT MUST NOT SHOULD
58. Statistics (RDF Complete)
Element Property Value
Summary
Level
Version
Level
Distribution
Level
class frequency void:classPartition IRI MUST NOT MUST NOT MAY
property frequency void:propertyPartition IRI MUST NOT MUST NOT MAY
property and subject
types void:propertyPartition IRI MUST NOT MUST NOT MAY
property and object
types void:propertyPartition IRI MUST NOT MUST NOT MAY
property and literals void:propertyPartition IRI MUST NOT MUST NOT MAY
property subject and
object types void:propertyPartition IRI MUST NOT MUST NOT MAY
4 December 2016 Alasdair Gray @gray_alasdair 60
59. Why provide rich dataset descriptions?
• Support
• Endpoint
exploration
• Query writing
• Eliminates
expensive
exploratory queries
4 December 2016 Alasdair Gray @gray_alasdair 61
60. Why provide rich dataset descriptions?
• Support
• Endpoint
exploration
• Query writing
• Eliminates
expensive
exploratory queries
4 December 2016 Alasdair Gray @gray_alasdair 62
61. Why provide rich dataset descriptions?
• Support
• Endpoint
exploration
• Query writing
• Eliminates
expensive
exploratory queries
4 December 2016 Alasdair Gray @gray_alasdair 63
62. Generate with SPARQL queries
Number of triples Subject-types related to Object-types
4 December 2016 Alasdair Gray @gray_alasdair 64
64. FAIR Data Principles
Findable
• Global persistent identifier
• Rich metadata
• Store metadata in registries
Accessible
• Resolvable identifiers
• Metadata persists
• Machine and human access
Interoperable
• Open data format
• Modelled with FAIR compliant vocabularies
• Reference external data
Reusable
• Rich metadata
• Clear license
• Provenance
4 December 2016 Alasdair Gray @gray_alasdair 66
Summary level: time unchanging information, e.g. name, description, publisher
Version level: version specific information, e.g. version number, creator, etc
Distribution level: file specific information, e.g. file location and format, number of triples
Reuse vocabularies: DCTerms, DCAT, VoID, FOAF, …
Prescribed properties: MUST, SHOULD, MAY, MUST NOT for each level
21 Properties
4 MUST
4 SHOULD
13 May
Link into data publishing pipeline via API
Not tied to HCLS, only a motivation
No existing tool meets these needs
Constraints form a graph pattern that data must comply with
How do we validate that our example data conforms to a certain shape
Express expected shape as ShEx
Toy example, what about for real
How do we validate that our example data conforms to a certain shape
Express expected shape as ShEx
Toy example, what about for real
How do we validate that our example data conforms to a certain shape
Express expected shape as ShEx
Toy example, what about for real
How do we validate that our example data conforms to a certain shape
Express expected shape as ShEx
Toy example, what about for real
ShEx:
Concise notation
regex based
W3C SHACL not stable when work done
ShEx is an implementation of SHACL with extra features
Step through validation process
Extended ShEx to allow arbitrary hierarchies
Toy example, what about for real
ShEx-validator has other dependencies too
Minimist: arguments parser
Promise: call backs
Pegjs: parser generator
Mocha: test driven development
Summary level: time unchanging information, e.g. name, description, publisher
Version level: version specific information, e.g. version number, creator, etc
Distribution level: file specific information, e.g. file location and format, number of triples
Acknowledge W3C HCLS IG