Validata: A tool for testing profile conformance

•Download as PPTX, PDF•

0 likes•1,774 views

Validata (http://hw-swel.github.io/Validata/) is an online web application for validating a dataset description expressed in RDF against a community profile expressed as a Shape Expression (ShEx). Additionally it provides an API for programmatic access to the validator. Validata is capable of being used for multiple community agreed standards, e.g. DCAT, the HCLS community profile, or the Open PHACTS guidelines, and there are currently deployments to support each of these. Validata can be easily repurposed for different deployments by providing it with a new ShEx schema. The Validata code is available from GitHub (https://github.com/HW-SWeL/Validata). Presentation given at SDSVoc https://www.w3.org/2016/11/sdsvoc

Requirements
• Online tool
– Deployable on W3C
server
– GUI
– API
• Support multiple
constraints
– Properties
– Data values
– …
• Requirement levels
– Different levels of
user messages:
Error, Warning,
Information
• Configurable
– HCLS (Required)
– DCAT, Open
PHACTS, etc
(Optional)
1 December 2016
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
3

Example Constraint
1 December 2016 4
• Shape
• A Dataset
– MUST be declared to be of type dctype:Dataset
– MUST have a dcterms:title as a language typed
string
– MUST NOT have dcterms:created date
<Dataset> rdf:langString
.
✗
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
Dates are associated
with versions in HCLS

Example Validation
1 December 2016 5
<Dataset> rdf:langString
.
✗
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
• Shape
• Data

Example Validation
• Shape
• Data
1 December 2016 6
<Dataset> rdf:langString
.
✗
@gray_alasdair
www.macs.hw.ac.uk/~ajg33

Example Validation
1 December 2016 7
<Dataset> rdf:langString
.
✗
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
• Shape
• Data

$<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } Shape 1 December 2016 8 <Dataset> rdf:langString . ✗ @gray_alasdair www.macs.hw.ac.uk/~ajg33 Shape Expressions (ShEx)$

$1 December 2016 9 @gray_alasdair www.macs.hw.ac.uk/~ajg33 ShEx: Validation <Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } <Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } <Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } <Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } <Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } <Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } Validator can’t warn of missing property Example data$

$<Dataset> { `MUST` rdf:type (dctypes:Dataset), `MUST` dct:title rdf:langString, `MAY` dct:alternative rdf:langString+, `MUST` !dct:created . } Shape 1 December 2016 10 <Dataset> rdf:langString . ✗ @gray_alasdair www.macs.hw.ac.uk/~ajg33 Requirement Levels Validator can warn of missing property$

Implementation
Validata
• Web app front end
• Javascript + HTML
• Relies on ShEx-validator
– Validates documents
– Returns report
https://github.com/HW-
SWeL/Validata
ShEx-validator
• Validation system
• Validation API
• Javascript
– nodejs engine
• Reuses
– n3: RDF Library
– ShExParser
https://github.com/HW-
SWeL/ShEx-validator
1 December 2016
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
11

http://hw-swel.github.io/Validata/
VALIDATA DEMO

Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this presentation is to give an overview of the HCLS Community Profile and explain how it extends and builds upon other approaches. Presentation given at SDSVoc (https://www.w3.org/2016/11/sdsvoc/)

Supporting Dataset Descriptions in the Life Sciences

Alasdair Gray

Machine processable descriptions of datasets can help make data more FAIR; that is Findable, Accessible, Interoperable, and Reusable. However, there are a variety of metadata profiles for describing datasets, some specific to the life sciences and others more generic in their focus. Each profile has its own set of properties and requirements as to which must be provided and which are more optional. Developing a dataset description for a given dataset to conform to a specific metadata profile is a challenging process. In this talk, I will give an overview of some of the dataset description specifications that are available. I will discuss the difficulties in writing a dataset description that conforms to a profile and the tooling that I've developed to support dataset publishers in creating metadata description and validating them against a chosen specification. Seminar talk given at the EBI on 5 April 2017

Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...

Alasdair Gray

Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this tutorial is to explain elements of the HCLS community profile and to enable users to craft and validate descriptions for datasets of interest.

An Identifier Scheme for the Digitising Scotland Project

Alasdair Gray

The Digitising Scotland project is having the vital records of Scotland transcribed from images of the original handwritten civil registers . Linking the resulting dataset of 24 million vital records covering the lives of 18 million people is a major challenge requiring improved record linkage techniques. Discussions within the multidisciplinary, widely distributed Digitising Scotland project team have been hampered by the teams in each of the institutions using their own identification scheme. To enable fruitful discussions within the Digitising Scotland team, we required a mechanism for uniquely identifying each individual represented on the certificates. From the identifier it should be possible to determine the type of certificate and the role each person played. We have devised a protocol to generate for any individual on the certificate a unique identifier, without using a computer, by exploiting the National Records of Scotland•À_s registration districts. Importantly, the approach does not rely on the handwritten content of the certificates which reduces the risk of the content being misread resulting in an incorrect identifier. The resulting identifier scheme has improved the internal discussions within the project. This paper discusses the rationale behind the chosen identifier scheme, and presents the format of the different identifiers. The work reported in the paper was supported by the British ESRC under grants ES/K00574X/1(Digitising Scotland) and ES/L007487/1 (Administrative Data Research Center - Scotland).

The DataTags System: Sharing Sensitive Data with Confidence

Merce Crosas

This talk was part of a session at the Research Data Alliance (RDA) 8th Plenary on Privacy Implications of Research Data Sets, during International Data Week 2016: https://rd-alliance.org/rda-8th-plenary-joint-meeting-ig-domain-repositories-wg-rdaniso-privacy-implications-research-data Slides in Merce Crosas site: http://scholar.harvard.edu/mercecrosas/presentations/datatags-system-sharing-sensitive-data-confidence

2009 0807 Lod Gmod

Jun Zhao

Metadata as Linked Data for Research Data Repositories

andrea huang

“Every man has his own cosmology and who can say that his own is right.” said by Einstein. This is also true when we come to understand data semantics that one data may be different interpreted by different data creators, curators and re-users. Then, how do we build a better research data repository? We start with the point made by Willis, C., Greenberg, J., & White, H. (2012) that the metadata of research data increases the access to and reuse of the data. And Stanford, Harvard, and Cornell believe the use of linked data technologies is a promising method to gather contextual information about research resources. To look for inspiration tools that can meet the urgent needs of innovative solutions providing feature-rich services for helping data publishing such as visualization, validation & reuse in different applications by research repositories (Assante, et.al, 2016), the CKAN (Comprehensive Knowledge Archive Network) as a major solution that makes linked metadata available, citable, and validated becomes our first choice. Original file: http://m.odw.tw/u/odw/m/metadata-as-linked-data-for-research-data-repositories/

Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this tutorial is to explain elements of the HCLS community profile and to enable users to craft and validate descriptions for datasets of interest.

An Identifier Scheme for the Digitising Scotland Project

Alasdair Gray

The DataTags System: Sharing Sensitive Data with Confidence

Merce Crosas

2009 0807 Lod Gmod

Jun Zhao

Metadata as Linked Data for Research Data Repositories

andrea huang

Data curation at Dryad Digital Repository: A former curator's perspective

Jane Frazier

Citations needed for the sum of all human knowledge: Wikidata as the missing ...

Dario Taraborelli

Data Repositories Impact

Merce Crosas

Verifiable, linked open knowledge that anyone can edit

Dario Taraborelli

Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...

Merce Crosas

Using Funding Data

Crossref

Yosemite part-4 webinar-final

DATAVERSITY

In our series on The Yosemite Project, we explore RDF as a data standard for health data. In this presentation, we will discuss with Claude Nanjo, a Software Architect at Cognitive Medical Systems, ways to expose clinical knowledge as OWL and RDF resources on the Web in order to promote greater convergence in the representation of health knowledge in the longer term. We will also explore how one might rally and coordinate the community to seed the Web with a core set of high-value resources and technologies that could greatly enhance health interoperability.

Introduction-and-RDF-Representation-of-FHIR-for-Clinical-Data

DATAVERSITY

Our speaker, Joshua Mandel, will provide a lightning tour of Fast Healthcare Interoperability Resources (FHIR), an emerging clinical data standard, with a focus on its resource-oriented approach, and a discussion of how FHIR intersects with the Semantic Web. We'll look at how FHIR represents links between entities; how FHIR represents concepts from standards-based vocabularies; and how a set of FHIR instance data can be represented in RDF.

Connecting Dataverse with the Research Life Cycle

Merce Crosas

DataTags, The Tags Toolset, and Dataverse Integration

Michael Bar-Sinai

Yosemite Project - Part 3 - Transformations for Integrating VA data with FHIR...

DATAVERSITY

In our series on The Yosemite Project, we explore RDF as a data standard for health data. In this installment, we will hear from Rafael Richards, Physician Informatician, Office of Informatics and Analytics in the Veterans Health Administration (VHA), about “Transformations for Integrating VA data with FHIR in RDF.” The VistA EHR has its own data model and vocabularies for representing healthcare data. This webinar describes how SPARQL Inference Notation (SPIN) can be used to translate VistA data to the data represented used by FHIR, an emerging interchange standard.

ORCID: An Overview - Alice Meadows

Crossref

FundRef Webinar

Crossref

BibBase Linked Data Triplification Challenge 2010 Presentation

Reynold Xin

Wikidata and the Semantic Web of Food

Benjamin Good

Who is using your metadata - Ginny Hendricks

Crossref

Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit

Dario Taraborelli

ReVeaLD: A user-driven domain-specific interactive search platform for biomed...Maulik Kamdar

Creating Incentives

datacite

Open semantic chemical structures

Stuart Chalk

Being FAIR: Enabling Reproducible Data Science

Carole Goble

Howe et al. - 2015 - BioAssay Research Database (BARD) chemical biologEleanor Howe

What's hot

Data curation at Dryad Digital Repository: A former curator's perspective

Jane Frazier

Citations needed for the sum of all human knowledge: Wikidata as the missing ...

Dario Taraborelli

Data Repositories Impact

Merce Crosas

Verifiable, linked open knowledge that anyone can edit

Dario Taraborelli

Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...

Merce Crosas

Using Funding Data

Crossref

Yosemite part-4 webinar-final

DATAVERSITY

Introduction-and-RDF-Representation-of-FHIR-for-Clinical-Data

DATAVERSITY

Connecting Dataverse with the Research Life Cycle

Merce Crosas

DataTags, The Tags Toolset, and Dataverse Integration

Michael Bar-Sinai

Yosemite Project - Part 3 - Transformations for Integrating VA data with FHIR...

DATAVERSITY

ORCID: An Overview - Alice Meadows

Crossref

FundRef Webinar

Crossref

BibBase Linked Data Triplification Challenge 2010 Presentation

Reynold Xin

Wikidata and the Semantic Web of Food

Benjamin Good

Who is using your metadata - Ginny Hendricks

Crossref

Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit

Dario Taraborelli

ReVeaLD: A user-driven domain-specific interactive search platform for biomed...Maulik Kamdar

Creating Incentives

datacite

Open semantic chemical structures

Stuart Chalk

What's hot (20)

Data curation at Dryad Digital Repository: A former curator's perspective

Citations needed for the sum of all human knowledge: Wikidata as the missing ...

Data Repositories Impact

Verifiable, linked open knowledge that anyone can edit

Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...

Using Funding Data

Yosemite part-4 webinar-final

Introduction-and-RDF-Representation-of-FHIR-for-Clinical-Data

Connecting Dataverse with the Research Life Cycle

DataTags, The Tags Toolset, and Dataverse Integration

Yosemite Project - Part 3 - Transformations for Integrating VA data with FHIR...

ORCID: An Overview - Alice Meadows

FundRef Webinar

BibBase Linked Data Triplification Challenge 2010 Presentation

Wikidata and the Semantic Web of Food

Who is using your metadata - Ginny Hendricks

Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit

ReVeaLD: A user-driven domain-specific interactive search platform for biomed...

Creating Incentives

Open semantic chemical structures

Similar to Validata: A tool for testing profile conformance

Being FAIR: Enabling Reproducible Data Science

Carole Goble

Howe et al. - 2015 - BioAssay Research Database (BARD) chemical biologEleanor Howe

Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...

GigaScience, BGI Hong Kong

Dataverse on the MOC

Merce Crosas

Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...

Scott Edmunds

Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...

GigaScience, BGI Hong Kong

National Data Archive (NADA) 3.0

mehmood78

NIH Data Commons Architecture Ideas

Ian Foster

Team Argon Summary

Ian Foster

Google Scholar's citation graph: comprehensive, global... and inaccessible

albertomartin101

Data Publishing at Harvard's Research Data Access Symposium

Merce Crosas

Data Publishing: The research community needs reliable, standard ways to make the data produced by scientific research available to the community, while giving credit to data authors. As a result, a new form of scholarly publication is emerging: data publishing. Data publishing - or making data reusable, citable, and accessible for long periods - is more than simply providing a link to a data file or posting the data to the researcher’s web site. We will discuss best practices, including the use of persistent identifiers and full data citations, the importance of metadata, the choice between public data and restricted data with terms of use, the workflows for collaboration and review before data release, and the role of trusted archival repositories. The Harvard Dataverse repository (and the Dataverse open-source software) provides a solution for data publishing, making it easy for researchers to follow these best practices, while satisfying data management requirements and incentivizing the sharing of research data.

CEDAR: Web-Based Tools for Accelerating the Creation of Standardized Metadata

Ahmad C. Bukhari

CEDAR: Web-Based Tools for Accelerating the Creation of Standardized Metadata

Syed Ahmad Chan Bukhari, PhD

Building Protected Data Sharing Networks to Advance Cancer Risk Assessment an...

Mary Bass

Dataset Sources Repositories.pptx

mantatheralyasriy

Crediting informatics and data folks in life science teams

Carole Goble

Dataset Sources Repositories.pptx

mantatheralyasriy

Dataset Sources Repositories.pptx

mantatheralyasriy

Data Standards & Best Practices for the Stratigraphic Record

Kerstin Lehnert

Data Repositories: Recommendation, Certification and Models for Cost Recovery

Anita de Waard

Similar to Validata: A tool for testing profile conformance (20)

Being FAIR: Enabling Reproducible Data Science

Howe et al. - 2015 - BioAssay Research Database (BARD) chemical biolog

Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...

Dataverse on the MOC

Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...

National Data Archive (NADA) 3.0

NIH Data Commons Architecture Ideas

Team Argon Summary

Google Scholar's citation graph: comprehensive, global... and inaccessible

Data Publishing at Harvard's Research Data Access Symposium

CEDAR: Web-Based Tools for Accelerating the Creation of Standardized Metadata

Building Protected Data Sharing Networks to Advance Cancer Risk Assessment an...

Dataset Sources Repositories.pptx

Crediting informatics and data folks in life science teams

Dataset Sources Repositories.pptx

Data Standards & Best Practices for the Stratigraphic Record

Data Repositories: Recommendation, Certification and Models for Cost Recovery

More from Alasdair Gray

Using a Jupyter Notebook to perform a reproducible scientific analysis over s...

Alasdair Gray

In recent years there has been a reproducibility crisis in science. Computational notebooks, such as Jupyter, have been touted as one solution to this problem. However, when executing analyses over live SPARQL endpoints, we get different answers depending upon when the analysis in the notebook was executed. In this paper, we identify some of the issues discovered in trying to develop a reproducible analysis over a collection of biomedical data sources and suggest some best practice to overcome these issues.

Bioschemas Community: Developing profiles over Schema.org to make life scienc...

Alasdair Gray

The Bioschemas community (http://bioschemas.org) is a loose collaboration formed by a wide range of life science resource providers and informaticians. The community is developing profiles over Schema.org to enable life science resources such as data about a specific protein, sample, or training event, to be more discoverable on the web. While the content of well-known resources such as Uniprot (for protein data) are easily discoverable, there is a long tail of specialist resources that would benefit from embedding Schema.org markup in a standardised approach. The community have developed twelve profiles for specific types of life science resources (http://bioschemas.org/specifications/), with another six at an early draft stage. For each profile, a set of use cases have been identified. These typically focus on search, but several facilitate lightweight data exchange to support data aggregators such as Identifiers.org, FAIRsharing.org, and BioSamples. The next stage of the development of a profile consists of mapping the terms used in the use cases to existing properties in Schema.org and domain ontologies. The properties are then prioritised in order to support the use cases, with a minimal set of about six properties identified, along with a larger set of recommended and optional properties. For each property, an expected cardinality is defined and where appropriate, object values are specified from controlled vocabularies. Before a profile is finalised, it must first be demonstrated that resources can deploy the markup. In this talk, we will outline the progress that has been made by the Bioschemas Community in a single year through three hackathon events. We will discuss the processes followed by the Bioschemas Community to foster collaboration, and highlight the benefits and drawbacks of using open Google documents and spreadsheets to support the community develop the profiles. We will conclude by summarising future opportunities and directions for the community.

Open PHACTS: The Data Today

Alasdair Gray

Project X

Alasdair Gray

This presentation was prepared for my faculty Christmas conference. Abstract: For the last 11 months I have been working on a top secret project with a world renowned Scandinavian industry partner. We are now moving into the exciting operational phase of this project. I have been granted an early lifting of the embargo that has stopped me talking about this work up until now. I will talk about the data science behind this big data project and how semantic web technology has enabled the delivery of Project X.

Data Integration in a Big Data Context: An Open PHACTS Case Study

Alasdair Gray

Data Integration in a Big Data Context

Alasdair Gray

Data is being generated all around us – from our smart phones tracking our movement through a city to the city itself sensing various properties and reacting to various conditions. However, to maximise the potential from all this data, it needs to be combined and coerced into models that enable analysis and interpretation. In this talk I will give an overview of the techniques that I have developed for data integration: integrating streams of sensor data with background contextual data and supporting multiple interpretations of linking data together. At the end of the talk I will overview the work I will be conducting in the Administrative Data Research Centre for Scotland.

Data Linkage

Alasdair Gray

Scientific lenses to support multiple views over linked chemistry data

Alasdair Gray

When are two entries about a small molecule in different datasets the same? If they have the same drug name, chemical structure, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied. In this paper, we present an approach to enable applications to choose the equivalence criteria to apply between datasets. Thus, supporting multiple dynamic views over the Linked Data. For chemical data, we show that multiple sets of links can be automatically generated according to different equivalence criteria and published with semantic descriptions capturing their context and interpretation. This approach has been applied within a large scale public-private data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.

Scientific Lenses over Linked Data An approach to support multiple integrate...

Alasdair Gray

When are two entries about a concept in different datasets the same? If they have the same name, properties, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied. In this presentation, I will introduce Scientific lenses, an approach that enables applications to vary the equivalence conditions between linked datasets. They have been deployed in the Open PHACTS Discovery Platform – a large scale data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.

Describing Scientific Datasets: The HCLS Community Profile

Alasdair Gray

Big Data presents an exciting opportunity to pursue large-scale analyses over collections of data in order to uncover valuable insights across a myriad of fields and disciplines. Yet, as more and more data is made available, researchers are finding it increasingly difficult to discover and reuse these data. One problem is that data are insufficiently described to understand what they are or how they were produced. A second issue is that no single vocabulary provides all key metadata fields required to support basic scientific use cases. A third issue is that data catalogs and data repositories all use different metadata standards, if they use any standard at all, and this prevents easy search and aggregation of data. Therefore, we need a community profile to indicate what are the essential metadata, and the manner in which we can express it. The W3C Health Care and Life Sciences Interest Group have developed such a community profile that defines the required properties to provide high-quality dataset descriptions that support finding, understanding, and reusing scientific data, i.e. making the data FAIR (Findable, Accessible, Interoperable and Re-usable – http://datafairport.org). The specification reuses many notions and vocabulary terms from Dublin Core, DCAT and VoID, with provenance and versioning information being provided by PROV-O and PAV. The community profile is based around a three tier model; the summary description captures catalogue style metadata about the dataset, each version of the dataset is described separately as are the various distribution formats of these versions. The resulting community profile is generic and applicable to a wide variety of scientific data. Tools are being developed to help with the creation and validation of these descriptions. Several datasets including those from Bio2RDF, EBI and IntegBio are already moving to release descriptions conforming to the community profile.

SensorBench

Alasdair Gray

SensorBench is a benchmark suite for wireless sensor networks. The design of wireless sensor network systems sits within a multi-dimensional design space, where it can be difficult to understand the implications of specific decisions and to identify optimal solutions. SensorBench enables the systematic analysis and comparison of different techniques and platforms, enabling both development and user communities to make well informed choices. The benchmark identifies key variables and performance metrics, and specifies experiments that explore how different types of task perform under different metrics for the controlled variables. The benchmark is demonstrated by its application on representative platforms. Full details of the benchmark are available from http://dl.acm.org/citation.cfm?id=2618252 (DOI: 10.1145/2618243.2618252)

Data Science meets Linked Data

Alasdair Gray

Sensors and Big Data for Health and Well-beingAlasdair Gray

Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...Alasdair Gray

Dataset Descriptions in Open PHACTS and HCLS

Alasdair Gray

This presentation gives an overview of the dataset description specification developed in the Open PHACTS project (http://www.openphacts.org/). The creation of the specification was driven by a real need within the project to track the datasets used. Details of the dataset metadata captured and the vocabularies used to model this metadata are given together with the tools developed to enable the specification's uptake. Over the course of the last 12 months, the W3C Healthcare and Life Science Interest Group have been developing a community profile for dataset descriptions. This has drawn on the ideas developed in the Open PHACTS specification. A brief overview of the forthcoming community profile is given in the presentation. This presentation was given to the Network Data Exchange project http://www.ndexbio.org/ on 2 April 2014.

Computing Identity Co-Reference Across Drug Discovery Datasets

Alasdair Gray

This paper presents the rules used within the Open PHACTS (http://www.openphacts.org) Identity Management Service to compute co-reference chains across multiple datasets. The web of (linked) data has encouraged a proliferation of identifiers for the concepts captured in datasets; with each dataset using their own identifier. A key data integration challenge is linking the co-referent identifiers, i.e. identifying and linking the equivalent concept in every dataset. Exacerbating this challenge, the datasets model the data differently, so when is one representation truly the same as another? Finally, different users have their own task and domain specific notions of equivalence that are driven by their operational knowledge. Consumers of the data need to be able to choose the notion of operational equivalence to be applied for the context of their application. We highlight the challenges of automatically computing co-reference and the need for capturing the context of the equivalence. This context is then used to control the co-reference computation. Ultimately, the context will enable data consumers to decide which co-references to include in their applications.

Incorporating Commercial and Private Data into an Open Linked Data Platform f...

Alasdair Gray

The Open PHACTS Discovery Platform aims to provide an integrated information space to advance pharmacological research in the area of drug discovery. Eﬀective drug discovery requires comprehensive data coverage, i.e. integrating all available sources of pharmacology data. While many relevant data sources are available on the linked open data cloud, their content needs to be combined with that of commercial datasets and the licensing of these commercial datasets respected when providing access to the data. Additionally, pharmaceutical companies have built up their own extensive private data collections that they require to be included in their pharmacological dataspace. In this paper we discuss the challenges of incorporating private and commercial data into a linked dataspace: focusing on the modelling of these datasets and their interlinking. We also present the graph-based access control mechanism that ensures commercial and private datasets are only available to authorized users. http://link.springer.com/chapter/10.1007/978-3-642-41338-4_5

Including Co-Referent URIs in a SPARQL Query

Alasdair Gray

Linked data relies on instance level links between potentially differing representations of concepts in multiple datasets. However, in large complex domains, such as pharmacology, the inter-relationship of data instances needs to consider the context (e.g. task, role) of the user and the assumptions they want to apply to the data. Such context is not taken into account in most linked data integration procedures. In this paper we argue that dataset links should be stored in a stand-off fashion, thus enabling different assumptions to be applied to the data links during query execution. We present the infrastructure developed for the Open PHACTS Discovery Platform to enable this and show through evaluation that the incurred performance cost is below the threshold of user perception. http://ceur-ws.org/Vol-1034/BrenninkmeijerEtAl_COLD2013.pdf

2013 01-14 ops-dataset_descriptions

Alasdair Gray

Alice: "What version of ChEMBL are we using?" Bob: "Er…let me check. It's going to take a while, I'll get back to you." This simple question took us the best part of a month to resolve and involved several individuals. Knowing the provenance of your data is essential, especially when using large complex systems that process multiple datasets. The underlying issues of this simple question motivated us to improve the provenance data in the Open PHACTS project. We developed a guideline for dataset descriptions where the metadata is carried with the data. In this talk I will highlight the challenges we faced and give an overview of our metadata guidelines. Presentation given to the W3C Semantic Web for Health Care and Life Sciences Interest Group on 14 January 2013.

More from Alasdair Gray (19)

Using a Jupyter Notebook to perform a reproducible scientific analysis over s...

Bioschemas Community: Developing profiles over Schema.org to make life scienc...

Open PHACTS: The Data Today

Project X

Data Integration in a Big Data Context: An Open PHACTS Case Study

Data Integration in a Big Data Context

Data Linkage

Scientific lenses to support multiple views over linked chemistry data

Scientific Lenses over Linked Data An approach to support multiple integrate...

Describing Scientific Datasets: The HCLS Community Profile

SensorBench

Data Science meets Linked Data

Sensors and Big Data for Health and Well-being

Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...

Dataset Descriptions in Open PHACTS and HCLS

Computing Identity Co-Reference Across Drug Discovery Datasets

Incorporating Commercial and Private Data into an Open Linked Data Platform f...

Including Co-Referent URIs in a SPARQL Query

2013 01-14 ops-dataset_descriptions

Recently uploaded

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Product School

Essentials of Automations: Optimizing FME Workflows with Parameters

Safe Software

Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place. Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects. Here’s what you’ll gain: - Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows. - Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy. - Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency. - Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity. We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic. Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.

When stars align: studies in data quality, knowledge graphs, and machine lear...

Elena Simperl

Assuring Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

Securing your Kubernetes cluster_ a step-by-step guide to success !

KatiaHIMEUR1

Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster. However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks. In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Product School

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Product School

The Future of Platform Engineering

Jemma Hussein Allen

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Inflectra

In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring. Learn about: • The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks. • Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective. • Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification. • Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process. Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.

Neuro-symbolic is not enough, we need neuro-*semantic*

Frank van Harmelen

Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”. All of this illustrated with link prediction over knowledge graphs, but the argument is general.

UiPath Test Automation using UiPath Test Suite series, part 4

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap. The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies. Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques What will you get from this session? 1. Insights into SAP testing best practices 2. Heatmap utilization for testing 3. Optimization of testing processes 4. Demo Topics covered: Execution from the test manager Orchestrator execution result Defect reporting SAP heatmap example with demo Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Tobias Schneck

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Product School

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

Leading Change strategies and insights for effective change management pdf 1.pdf

OnBoard

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

FIDO Alliance

Bits & Pixels using AI for Good.........

Alison B. Lowndes

Connector Corner: Automate dynamic content and events by pushing a button

DianaGray10

Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to: Create a campaign using Mailchimp with merge tags/fields Send an interactive Slack channel message (using buttons) Have the message received by managers and peers along with a test email for review But there’s more: In a second workflow supporting the same use case, you’ll see: Your campaign sent to target colleagues for approval If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team But—if the “Reject” button is pushed, colleagues will be alerted via Slack message Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors. And... Speakers: Akshay Agnihotri, Product Manager Charlie Greenberg, Host

Recently uploaded (20)

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Essentials of Automations: Optimizing FME Workflows with Parameters

When stars align: studies in data quality, knowledge graphs, and machine lear...

Assuring Contact Center Experiences for Your Customers With ThousandEyes

Securing your Kubernetes cluster_ a step-by-step guide to success !

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

The Future of Platform Engineering

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Neuro-symbolic is not enough, we need neuro-*semantic*

UiPath Test Automation using UiPath Test Suite series, part 4

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

FIDO Alliance Osaka Seminar: Overview.pdf

Leading Change strategies and insights for effective change management pdf 1.pdf

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

Bits & Pixels using AI for Good.........

Connector Corner: Automate dynamic content and events by pushing a button

Validata: A tool for testing profile conformance

1. Validata: A tool for testing profile conformance Alasdair J G Gray Heriot-Watt University www.macs.hw.ac.uk/~ajg33 A.J.G.Gray@hw.ac.uk @gray_alasdair Andrew Beveridge Jacob Baungard Hansen Johnny Val Leif Gehrmann Roisin Farmer Sunil Khutan Tomas Robertson

2. HCLS Dataset Descriptions https://www.w3.org/TR/hcls-dataset/ Dumontier M, Gray AJG, Marshall MS, et al. (2016) The health care and life sciences community profile for dataset descriptions. PeerJ 4:e2331 https://doi.org/10.7717/peerj.2331 1 December 2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 2

3. Requirements • Online tool – Deployable on W3C server – GUI – API • Support multiple constraints – Properties – Data values – … • Requirement levels – Different levels of user messages: Error, Warning, Information • Configurable – HCLS (Required) – DCAT, Open PHACTS, etc (Optional) 1 December 2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 3

4. Example Constraint 1 December 2016 4 • Shape • A Dataset – MUST be declared to be of type dctype:Dataset – MUST have a dcterms:title as a language typed string – MUST NOT have dcterms:created date <Dataset> rdf:langString . ✗ @gray_alasdair www.macs.hw.ac.uk/~ajg33 Dates are associated with versions in HCLS

5. Example Validation 1 December 2016 5 <Dataset> rdf:langString . ✗ @gray_alasdair www.macs.hw.ac.uk/~ajg33 • Shape • Data

6. Example Validation • Shape • Data 1 December 2016 6 <Dataset> rdf:langString . ✗ @gray_alasdair www.macs.hw.ac.uk/~ajg33

7. Example Validation 1 December 2016 7 <Dataset> rdf:langString . ✗ @gray_alasdair www.macs.hw.ac.uk/~ajg33 • Shape • Data

8. <Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } Shape 1 December 2016 8 <Dataset> rdf:langString . ✗ @gray_alasdair www.macs.hw.ac.uk/~ajg33 Shape Expressions (ShEx)

9. 1 December 2016 9 @gray_alasdair www.macs.hw.ac.uk/~ajg33 ShEx: Validation <Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } <Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } <Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } <Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } <Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } <Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } Validator can’t warn of missing property Example data

10. <Dataset> { `MUST` rdf:type (dctypes:Dataset), `MUST` dct:title rdf:langString, `MAY` dct:alternative rdf:langString+, `MUST` !dct:created . } Shape 1 December 2016 10 <Dataset> rdf:langString . ✗ @gray_alasdair www.macs.hw.ac.uk/~ajg33 Requirement Levels Validator can warn of missing property

11. Implementation Validata • Web app front end • Javascript + HTML • Relies on ShEx-validator – Validates documents – Returns report https://github.com/HW- SWeL/Validata ShEx-validator • Validation system • Validation API • Javascript – nodejs engine • Reuses – n3: RDF Library – ShExParser https://github.com/HW- SWeL/ShEx-validator 1 December 2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 11

12. http://hw-swel.github.io/Validata/ VALIDATA DEMO

13. Validata https://github.com/HW-SWeL/Validata • RDF constraint validation tool – Configurable to any profile • Shape Expression (ShEx) constraints • Open source javascript implementation www.macs.hw.ac.uk/~ajg33/ A.J.G.Gray@hw.ac.uk @gray_alasdair

Editor's Notes

Motivation: how do we check descriptions conform? Summary level: time unchanging information, e.g. name, description, publisher Version level: version specific information, e.g. version number, creator, etc Distribution level: file specific information, e.g. file location and format, number of triples 18 vocabularies: DCTerms, DCAT, VoID, FOAF, … 61 prescribed properties: MUST, SHOULD, MAY, MUST NOT for each level
Link into data publishing pipeline via API Not tied to HCLS, only a motivation No existing tool meets these needs
Constraints form a graph pattern that data must comply with
How do we validate that our example data conforms to a certain shape Express expected shape as ShEx Toy example, what about for real
How do we validate that our example data conforms to a certain shape Express expected shape as ShEx Toy example, what about for real
How do we validate that our example data conforms to a certain shape Express expected shape as ShEx Toy example, what about for real
ShEx: Concise notation regex based W3C SHACL not stable when work done ShEx is an implementation of SHACL with extra features
Step through validation process
Extended ShEx to allow arbitrary hierarchies Toy example, what about for real
ShEx-validator has other dependencies too Minimist: arguments parser Promise: call backs Pegjs: parser generator Mocha: test driven development

Validata: A tool for testing profile conformance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Validata: A tool for testing profile conformance

Similar to Validata: A tool for testing profile conformance (20)

More from Alasdair Gray

More from Alasdair Gray (19)

Recently uploaded

Recently uploaded (20)

Validata: A tool for testing profile conformance

Editor's Notes