Validata (http://hw-swel.github.io/Validata/) is an online web application for validating a dataset description expressed in RDF against a community profile expressed as a Shape Expression (ShEx). Additionally it provides an API for programmatic access to the validator. Validata is capable of being used for multiple community agreed standards, e.g. DCAT, the HCLS community profile, or the Open PHACTS guidelines, and there are currently deployments to support each of these. Validata can be easily repurposed for different deployments by providing it with a new ShEx schema. The Validata code is available from GitHub (https://github.com/HW-SWeL/Validata).
Presentation given at SDSVoc https://www.w3.org/2016/11/sdsvoc
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsAlasdair Gray
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets.
The goal of this presentation is to give an overview of the HCLS Community Profile and explain how it extends and builds upon other approaches.
Presentation given at SDSVoc (https://www.w3.org/2016/11/sdsvoc/)
Supporting Dataset Descriptions in the Life SciencesAlasdair Gray
Machine processable descriptions of datasets can help make data more FAIR; that is Findable, Accessible, Interoperable, and Reusable. However, there are a variety of metadata profiles for describing datasets, some specific to the life sciences and others more generic in their focus. Each profile has its own set of properties and requirements as to which must be provided and which are more optional. Developing a dataset description for a given dataset to conform to a specific metadata profile is a challenging process.
In this talk, I will give an overview of some of the dataset description specifications that are available. I will discuss the difficulties in writing a dataset description that conforms to a profile and the tooling that I've developed to support dataset publishers in creating metadata description and validating them against a chosen specification.
Seminar talk given at the EBI on 5 April 2017
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Alasdair Gray
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this tutorial is to explain elements of the HCLS community profile and to enable users to craft and validate descriptions for datasets of interest.
An Identifier Scheme for the Digitising Scotland ProjectAlasdair Gray
The Digitising Scotland project is having the vital records of Scotland transcribed from images of the original handwritten civil registers . Linking the resulting dataset of 24 million vital records covering the lives of 18 million people is a major challenge requiring improved record linkage techniques. Discussions within the multidisciplinary, widely distributed Digitising Scotland project team have been hampered by the teams in each of the institutions using their own identification scheme. To enable fruitful discussions within the Digitising Scotland team, we required a mechanism for uniquely identifying each individual represented on the certificates. From the identifier it should be possible to determine the type of certificate and the role each person played. We have devised a protocol to generate for any individual on the certificate a unique identifier, without using a computer, by exploiting the National Records of Scotland•À_s registration districts. Importantly, the approach does not rely on the handwritten content of the certificates which reduces the risk of the content being misread resulting in an incorrect identifier. The resulting identifier scheme has improved the internal discussions within the project. This paper discusses the rationale behind the chosen identifier scheme, and presents the format of the different identifiers. The work reported in the paper was supported by the British ESRC under grants ES/K00574X/1(Digitising Scotland) and ES/L007487/1 (Administrative Data Research Center - Scotland).
The DataTags System: Sharing Sensitive Data with ConfidenceMerce Crosas
This talk was part of a session at the Research Data Alliance (RDA) 8th Plenary on Privacy Implications of Research Data Sets, during International Data Week 2016:
https://rd-alliance.org/rda-8th-plenary-joint-meeting-ig-domain-repositories-wg-rdaniso-privacy-implications-research-data
Slides in Merce Crosas site:
http://scholar.harvard.edu/mercecrosas/presentations/datatags-system-sharing-sensitive-data-confidence
Metadata as Linked Data for Research Data Repositoriesandrea huang
“Every man has his own cosmology and who can say that his own is right.” said by Einstein. This is also true when we come to understand data semantics that one data may be different interpreted by different data creators, curators and re-users. Then, how do we build a better research data repository?
We start with the point made by Willis, C., Greenberg, J., & White, H. (2012) that the metadata of research data increases the access to and reuse of the data. And Stanford, Harvard, and Cornell believe the use of linked data technologies is a promising method to gather contextual information about research resources.
To look for inspiration tools that can meet the urgent needs of innovative solutions providing feature-rich services for helping data publishing such as visualization, validation & reuse in different applications by research repositories (Assante, et.al, 2016), the CKAN (Comprehensive Knowledge Archive Network) as a major solution that makes linked metadata available, citable, and validated becomes our first choice.
Original file: http://m.odw.tw/u/odw/m/metadata-as-linked-data-for-research-data-repositories/
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsAlasdair Gray
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets.
The goal of this presentation is to give an overview of the HCLS Community Profile and explain how it extends and builds upon other approaches.
Presentation given at SDSVoc (https://www.w3.org/2016/11/sdsvoc/)
Supporting Dataset Descriptions in the Life SciencesAlasdair Gray
Machine processable descriptions of datasets can help make data more FAIR; that is Findable, Accessible, Interoperable, and Reusable. However, there are a variety of metadata profiles for describing datasets, some specific to the life sciences and others more generic in their focus. Each profile has its own set of properties and requirements as to which must be provided and which are more optional. Developing a dataset description for a given dataset to conform to a specific metadata profile is a challenging process.
In this talk, I will give an overview of some of the dataset description specifications that are available. I will discuss the difficulties in writing a dataset description that conforms to a profile and the tooling that I've developed to support dataset publishers in creating metadata description and validating them against a chosen specification.
Seminar talk given at the EBI on 5 April 2017
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Alasdair Gray
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this tutorial is to explain elements of the HCLS community profile and to enable users to craft and validate descriptions for datasets of interest.
An Identifier Scheme for the Digitising Scotland ProjectAlasdair Gray
The Digitising Scotland project is having the vital records of Scotland transcribed from images of the original handwritten civil registers . Linking the resulting dataset of 24 million vital records covering the lives of 18 million people is a major challenge requiring improved record linkage techniques. Discussions within the multidisciplinary, widely distributed Digitising Scotland project team have been hampered by the teams in each of the institutions using their own identification scheme. To enable fruitful discussions within the Digitising Scotland team, we required a mechanism for uniquely identifying each individual represented on the certificates. From the identifier it should be possible to determine the type of certificate and the role each person played. We have devised a protocol to generate for any individual on the certificate a unique identifier, without using a computer, by exploiting the National Records of Scotland•À_s registration districts. Importantly, the approach does not rely on the handwritten content of the certificates which reduces the risk of the content being misread resulting in an incorrect identifier. The resulting identifier scheme has improved the internal discussions within the project. This paper discusses the rationale behind the chosen identifier scheme, and presents the format of the different identifiers. The work reported in the paper was supported by the British ESRC under grants ES/K00574X/1(Digitising Scotland) and ES/L007487/1 (Administrative Data Research Center - Scotland).
The DataTags System: Sharing Sensitive Data with ConfidenceMerce Crosas
This talk was part of a session at the Research Data Alliance (RDA) 8th Plenary on Privacy Implications of Research Data Sets, during International Data Week 2016:
https://rd-alliance.org/rda-8th-plenary-joint-meeting-ig-domain-repositories-wg-rdaniso-privacy-implications-research-data
Slides in Merce Crosas site:
http://scholar.harvard.edu/mercecrosas/presentations/datatags-system-sharing-sensitive-data-confidence
Metadata as Linked Data for Research Data Repositoriesandrea huang
“Every man has his own cosmology and who can say that his own is right.” said by Einstein. This is also true when we come to understand data semantics that one data may be different interpreted by different data creators, curators and re-users. Then, how do we build a better research data repository?
We start with the point made by Willis, C., Greenberg, J., & White, H. (2012) that the metadata of research data increases the access to and reuse of the data. And Stanford, Harvard, and Cornell believe the use of linked data technologies is a promising method to gather contextual information about research resources.
To look for inspiration tools that can meet the urgent needs of innovative solutions providing feature-rich services for helping data publishing such as visualization, validation & reuse in different applications by research repositories (Assante, et.al, 2016), the CKAN (Comprehensive Knowledge Archive Network) as a major solution that makes linked metadata available, citable, and validated becomes our first choice.
Original file: http://m.odw.tw/u/odw/m/metadata-as-linked-data-for-research-data-repositories/
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Merce Crosas
Presentation for the NFAIS Webinar series: Open Data Fostering Open Science: Meeting Researchers' Needs
http://www.nfais.org/index.php?option=com_mc&view=mc&mcid=72&eventId=508850&orgId=nfais
In our series on The Yosemite Project, we explore RDF as a data standard for health data. In this presentation, we will discuss with Claude Nanjo, a Software Architect at Cognitive Medical Systems, ways to expose clinical knowledge as OWL and RDF resources on the Web in order to promote greater convergence in the representation of health knowledge in the longer term. We will also explore how one might rally and coordinate the community to seed the Web with a core set of high-value resources and technologies that could greatly enhance health interoperability.
Our speaker, Joshua Mandel, will provide a lightning tour of Fast Healthcare Interoperability Resources (FHIR), an emerging clinical data standard, with a focus on its resource-oriented approach, and a discussion of how FHIR intersects with the Semantic Web. We'll look at how FHIR represents links between entities; how FHIR represents concepts from standards-based vocabularies; and how a set of FHIR instance data can be represented in RDF.
DataTags, The Tags Toolset, and Dataverse IntegrationMichael Bar-Sinai
This presentation describes the concept of DataTags, which simplifies handling of sensitive datasets. It then shows the Tags toolset, and how it is integrated with Dataverse, Harvard's popular dataset repository.
Yosemite Project - Part 3 - Transformations for Integrating VA data with FHIR...DATAVERSITY
In our series on The Yosemite Project, we explore RDF as a data standard for health data. In this installment, we will hear from Rafael Richards, Physician Informatician, Office of Informatics and Analytics in the Veterans Health Administration (VHA), about “Transformations for Integrating VA data with FHIR in RDF.”
The VistA EHR has its own data model and vocabularies for representing healthcare data. This webinar describes how SPARQL Inference Notation (SPIN) can be used to translate VistA data to the data represented used by FHIR, an emerging interchange standard.
This webinar explains the service, covers what publishers need to participate and answers any questions you may have. This webinar was held on November 10, 2015.
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can EditDario Taraborelli
Slides for my September 23 talk on Wikidata and WikiCite – NIH Frontiers in Data Science lecture series.
Persistent URL: https://dx.doi.org/10.6084/m9.figshare.3850821
This paper describe a proposed open semantic representation of chemical structure using JSON-LD (JSON for linked data) and an example of the semantic inferencing of a chemical structure concept (chirality).
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
Talk presented at Early Detection of Cancer Conference, OHSU, Portland, Oregon USA, 2-4 Oct 2018, http://earlydetectionresearch.com/ in the Data Science session
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Merce Crosas
Presentation for the NFAIS Webinar series: Open Data Fostering Open Science: Meeting Researchers' Needs
http://www.nfais.org/index.php?option=com_mc&view=mc&mcid=72&eventId=508850&orgId=nfais
In our series on The Yosemite Project, we explore RDF as a data standard for health data. In this presentation, we will discuss with Claude Nanjo, a Software Architect at Cognitive Medical Systems, ways to expose clinical knowledge as OWL and RDF resources on the Web in order to promote greater convergence in the representation of health knowledge in the longer term. We will also explore how one might rally and coordinate the community to seed the Web with a core set of high-value resources and technologies that could greatly enhance health interoperability.
Our speaker, Joshua Mandel, will provide a lightning tour of Fast Healthcare Interoperability Resources (FHIR), an emerging clinical data standard, with a focus on its resource-oriented approach, and a discussion of how FHIR intersects with the Semantic Web. We'll look at how FHIR represents links between entities; how FHIR represents concepts from standards-based vocabularies; and how a set of FHIR instance data can be represented in RDF.
DataTags, The Tags Toolset, and Dataverse IntegrationMichael Bar-Sinai
This presentation describes the concept of DataTags, which simplifies handling of sensitive datasets. It then shows the Tags toolset, and how it is integrated with Dataverse, Harvard's popular dataset repository.
Yosemite Project - Part 3 - Transformations for Integrating VA data with FHIR...DATAVERSITY
In our series on The Yosemite Project, we explore RDF as a data standard for health data. In this installment, we will hear from Rafael Richards, Physician Informatician, Office of Informatics and Analytics in the Veterans Health Administration (VHA), about “Transformations for Integrating VA data with FHIR in RDF.”
The VistA EHR has its own data model and vocabularies for representing healthcare data. This webinar describes how SPARQL Inference Notation (SPIN) can be used to translate VistA data to the data represented used by FHIR, an emerging interchange standard.
This webinar explains the service, covers what publishers need to participate and answers any questions you may have. This webinar was held on November 10, 2015.
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can EditDario Taraborelli
Slides for my September 23 talk on Wikidata and WikiCite – NIH Frontiers in Data Science lecture series.
Persistent URL: https://dx.doi.org/10.6084/m9.figshare.3850821
This paper describe a proposed open semantic representation of chemical structure using JSON-LD (JSON for linked data) and an example of the semantic inferencing of a chemical structure concept (chirality).
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
Talk presented at Early Detection of Cancer Conference, OHSU, Portland, Oregon USA, 2-4 Oct 2018, http://earlydetectionresearch.com/ in the Data Science session
NADA is an open source web application for archiving, searching and browsing microdata using the data documentation initiative (DDI).
Key features are:
- Support DDI and RDF
- Search studies and variables
- Compare variables
- Provides data access for datasets using Public, Licensed, Direct and Data Enclave
We presented these slides at the NIH Data Commons kickoff meeting, showing some of the technologies that we propose to integrate in our "full stack" pilot.
Data Publishing at Harvard's Research Data Access SymposiumMerce Crosas
Data Publishing: The research community needs reliable, standard ways to make the data produced by scientific research available to the community, while giving credit to data authors. As a result, a new form of scholarly publication is emerging: data publishing. Data publishing - or making data reusable, citable, and accessible for long periods - is more than simply providing a link to a data file or posting the data to the researcher’s web site. We will discuss best practices, including the use of persistent identifiers and full data citations, the importance of metadata, the choice between public data and restricted data with terms of use, the workflows for collaboration and review before data release, and the role of trusted archival repositories. The Harvard Dataverse repository (and the Dataverse open-source software) provides a solution for data publishing, making it easy for researchers to follow these best practices, while satisfying data management requirements and incentivizing the sharing of research data.
CEDAR: Web-Based Tools for Accelerating the Creation of Standardized MetadataAhmad C. Bukhari
Mark A. Musen, John Graybeal, Martin J. O'Connor, Marcos Martínez-Romero, Attila L. Egyedi, Debra Willrett, S. Ahmad Chan Bukhari, Steven H. Kleinstein, Kei-Hoi Cheung, Elizabeth Thomson, Alejandra Gonzalez-Beltran, Phillipe Rocca-Serra, and Susanna-Assunta Sansone
Building Protected Data Sharing Networks to Advance Cancer Risk Assessment an...Mary Bass
At the National Cancer Institute's 2018 Annual Meeting for ITCR (Information Technology for Cancer Research), Ian Foster, Globus co-founder and Argonne Data Science and Learning Division Director, presented on "Building Protected Data Sharing Networks to Advance Cancer Risk Assessment and Treatment."
Crediting informatics and data folks in life science teamsCarole Goble
Science Europe LEGS Committee: Career Pathways in Multidisciplinary Research: How to Assess the Contributions of Single Authors in Large Teams, 1-2 Dec 2015, Brussels
The People Behind Research Software crediting from the informatics, technical point of view
The best way to publish and share research data is with a research data repository. A repository is an online database that allows research data to be preserved across time and helps others find it.
Data Repositories: Recommendation, Certification and Models for Cost RecoveryAnita de Waard
Talk at NITRD Workshop "Measuring the Impact of Digital Repositories" February 28 – March 1, 2017 https://www.nitrd.gov/nitrdgroups/index.php?title=DigitalRepositories
Similar to Validata: A tool for testing profile conformance (20)
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...Alasdair Gray
In recent years there has been a reproducibility crisis in science. Computational notebooks, such as Jupyter, have been touted as one solution to this problem. However, when executing analyses over live SPARQL endpoints, we get different answers depending upon when the analysis in the notebook was executed. In this paper, we identify some of the issues discovered in trying to develop a reproducible analysis over a collection of biomedical data sources and suggest some best practice to overcome these issues.
Bioschemas Community: Developing profiles over Schema.org to make life scienc...Alasdair Gray
The Bioschemas community (http://bioschemas.org) is a loose collaboration formed by a wide range of life science resource providers and informaticians. The community is developing profiles over Schema.org to enable life science resources such as data about a specific protein, sample, or training event, to be more discoverable on the web. While the content of well-known resources such as Uniprot (for protein data) are easily discoverable, there is a long tail of specialist resources that would benefit from embedding Schema.org markup in a standardised approach.
The community have developed twelve profiles for specific types of life science resources (http://bioschemas.org/specifications/), with another six at an early draft stage. For each profile, a set of use cases have been identified. These typically focus on search, but several facilitate lightweight data exchange to support data aggregators such as Identifiers.org, FAIRsharing.org, and BioSamples. The next stage of the development of a profile consists of mapping the terms used in the use cases to existing properties in Schema.org and domain ontologies. The properties are then prioritised in order to support the use cases, with a minimal set of about six properties identified, along with a larger set of recommended and optional properties. For each property, an expected cardinality is defined and where appropriate, object values are specified from controlled vocabularies. Before a profile is finalised, it must first be demonstrated that resources can deploy the markup.
In this talk, we will outline the progress that has been made by the Bioschemas Community in a single year through three hackathon events. We will discuss the processes followed by the Bioschemas Community to foster collaboration, and highlight the benefits and drawbacks of using open Google documents and spreadsheets to support the community develop the profiles. We will conclude by summarising future opportunities and directions for the community.
Presentation given at the Open PHACTS project symposium.
The slides give an overview of the data in the 2.0 Open PHACTS drug discovery platform and the challenges that have been faced in the Open PHACTS project to reach this stage.
This presentation was prepared for my faculty Christmas conference.
Abstract: For the last 11 months I have been working on a top secret project with a world renowned Scandinavian industry partner. We are now moving into the exciting operational phase of this project. I have been granted an early lifting of the embargo that has stopped me talking about this work up until now. I will talk about the data science behind this big data project and how semantic web technology has enabled the delivery of Project X.
Data Integration in a Big Data Context: An Open PHACTS Case StudyAlasdair Gray
Keynote presentation at the EU Ambient Assisted Living Forum workshop The Crusade for Big Data in the AAL Domain.
The presentation explores the Open PHACTS project and how it overcame various Big Data challenges.
Data is being generated all around us – from our smart phones tracking our movement through a city to the city itself sensing various properties and reacting to various conditions. However, to maximise the potential from all this data, it needs to be combined and coerced into models that enable analysis and interpretation. In this talk I will give an overview of the techniques that I have developed for data integration: integrating streams of sensor data with background contextual data and supporting multiple interpretations of linking data together. At the end of the talk I will overview the work I will be conducting in the Administrative Data Research Centre for Scotland.
Many areas of scientific discovery rely on combining data from multiples data sources. However there are many challenges in linking data. This presentation highlights these challenges in the context of using Linked Data for environmental and social science databases.
Scientific lenses to support multiple views over linked chemistry dataAlasdair Gray
When are two entries about a small molecule in different datasets the same? If they have the same drug name, chemical structure, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied.
In this paper, we present an approach to enable applications to choose the equivalence criteria to apply between datasets. Thus, supporting multiple dynamic views over the Linked Data. For chemical data, we show that multiple sets of links can be automatically generated according to different equivalence criteria and published with semantic descriptions capturing their context and interpretation. This approach has been applied within a large scale public-private data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.
Scientific Lenses over Linked Data An approach to support multiple integrate...Alasdair Gray
When are two entries about a concept in different datasets the same? If they have the same name, properties, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied.
In this presentation, I will introduce Scientific lenses, an approach that enables applications to vary the equivalence conditions between linked datasets. They have been deployed in the Open PHACTS Discovery Platform – a large scale data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.
Describing Scientific Datasets: The HCLS Community ProfileAlasdair Gray
Big Data presents an exciting opportunity to pursue large-scale analyses over collections of data in order to uncover valuable insights across a myriad of fields and disciplines. Yet, as more and more data is made available, researchers are finding it increasingly difficult to discover and reuse these data. One problem is that data are insufficiently described to understand what they are or how they were produced. A second issue is that no single vocabulary provides all key metadata fields required to support basic scientific use cases. A third issue is that data catalogs and data repositories all use different metadata standards, if they use any standard at all, and this prevents easy search and aggregation of data. Therefore, we need a community profile to indicate what are the essential metadata, and the manner in which we can express it.
The W3C Health Care and Life Sciences Interest Group have developed such a community profile that defines the required properties to provide high-quality dataset descriptions that support finding, understanding, and reusing scientific data, i.e. making the data FAIR (Findable, Accessible, Interoperable and Re-usable – http://datafairport.org). The specification reuses many notions and vocabulary terms from Dublin Core, DCAT and VoID, with provenance and versioning information being provided by PROV-O and PAV. The community profile is based around a three tier model; the summary description captures catalogue style metadata about the dataset, each version of the dataset is described separately as are the various distribution formats of these versions. The resulting community profile is generic and applicable to a wide variety of scientific data.
Tools are being developed to help with the creation and validation of these descriptions. Several datasets including those from Bio2RDF, EBI and IntegBio are already moving to release descriptions conforming to the community profile.
SensorBench is a benchmark suite for wireless sensor networks. The design of wireless sensor network systems sits within a multi-dimensional design space, where it can be difficult to understand the implications of specific decisions and to identify optimal solutions. SensorBench enables the systematic analysis and comparison of different techniques and platforms, enabling both development and user communities to make well informed choices. The benchmark identifies key variables and performance metrics, and specifies experiments that explore how different types of task perform under different metrics for the controlled variables. The benchmark is demonstrated by its application on representative platforms.
Full details of the benchmark are available from http://dl.acm.org/citation.cfm?id=2618252 (DOI: 10.1145/2618243.2618252)
What are the research and technical challenges of linked data that are relevant to data science?
This presentation introduces the ideas of linked data using the BBC sport web site as an example. It then identifies several research challenges that remain to be addressed.
Dataset Descriptions in Open PHACTS and HCLSAlasdair Gray
This presentation gives an overview of the dataset description specification developed in the Open PHACTS project (http://www.openphacts.org/). The creation of the specification was driven by a real need within the project to track the datasets used.
Details of the dataset metadata captured and the vocabularies used to model this metadata are given together with the tools developed to enable the specification's uptake.
Over the course of the last 12 months, the W3C Healthcare and Life Science Interest Group have been developing a community profile for dataset descriptions. This has drawn on the ideas developed in the Open PHACTS specification. A brief overview of the forthcoming community profile is given in the presentation.
This presentation was given to the Network Data Exchange project http://www.ndexbio.org/ on 2 April 2014.
Computing Identity Co-Reference Across Drug Discovery DatasetsAlasdair Gray
This paper presents the rules used within the Open PHACTS (http://www.openphacts.org) Identity Management Service to compute co-reference chains across multiple datasets. The web of (linked) data has encouraged a proliferation of identifiers for the concepts captured in datasets; with each dataset using their own identifier. A key data integration challenge is linking the co-referent identifiers, i.e. identifying and linking the equivalent concept in every dataset. Exacerbating this challenge, the datasets model the data differently, so when is one representation truly the same as another? Finally, different users have their own task and domain specific notions of equivalence that are driven by their operational knowledge. Consumers of the data need to be able to choose the notion of operational equivalence to be applied for the context of their application. We highlight the challenges of automatically computing co-reference and the need for capturing the context of the equivalence. This context is then used to control the co-reference computation. Ultimately, the context will enable data consumers to decide which co-references to include in their applications.
Incorporating Commercial and Private Data into an Open Linked Data Platform f...Alasdair Gray
The Open PHACTS Discovery Platform aims to provide an integrated information space to advance pharmacological research in the area of drug discovery. Effective drug discovery requires comprehensive data coverage, i.e. integrating all available sources of pharmacology data. While many relevant data sources are available on the linked open data cloud, their content needs to be combined with that of commercial datasets and the licensing of these commercial datasets respected when providing access to the data. Additionally, pharmaceutical companies have built up their own extensive private data collections that they require to be included in their pharmacological dataspace. In this paper we discuss the challenges of incorporating private and commercial data into a linked dataspace: focusing on the modelling of these datasets and their interlinking. We also present the graph-based access control mechanism that ensures commercial and private datasets are only available to authorized users.
http://link.springer.com/chapter/10.1007/978-3-642-41338-4_5
Including Co-Referent URIs in a SPARQL QueryAlasdair Gray
Linked data relies on instance level links between potentially differing representations of concepts in multiple datasets. However, in large complex domains, such as pharmacology, the inter-relationship of data instances needs to consider the context (e.g. task, role) of the user and the assumptions they want to apply to the data. Such context is not taken into account in most linked data integration procedures. In this paper we argue that dataset links should be stored in a stand-off fashion, thus enabling different assumptions to be applied to the data links during query execution. We present the infrastructure developed for the Open PHACTS Discovery Platform to enable this and show through evaluation that the incurred performance cost is below the threshold of user perception.
http://ceur-ws.org/Vol-1034/BrenninkmeijerEtAl_COLD2013.pdf
Alice: "What version of ChEMBL are we using?"
Bob: "Er…let me check. It's going to take a while, I'll get back to you."
This simple question took us the best part of a month to resolve and involved several individuals. Knowing the provenance of your data is essential, especially when using large complex systems that process multiple datasets.
The underlying issues of this simple question motivated us to improve the provenance data in the Open PHACTS project. We developed a guideline for dataset descriptions where the metadata is carried with the data. In this talk I will highlight the challenges we faced and give an overview of our metadata guidelines.
Presentation given to the W3C Semantic Web for Health Care and Life Sciences Interest Group on 14 January 2013.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Connector Corner: Automate dynamic content and events by pushing a button
Validata: A tool for testing profile conformance
1. Validata: A tool for testing
profile conformance
Alasdair J G Gray
Heriot-Watt University
www.macs.hw.ac.uk/~ajg33
A.J.G.Gray@hw.ac.uk
@gray_alasdair
Andrew Beveridge
Jacob Baungard Hansen
Johnny Val
Leif Gehrmann
Roisin Farmer
Sunil Khutan
Tomas Robertson
2. HCLS Dataset Descriptions
https://www.w3.org/TR/hcls-dataset/
Dumontier M, Gray AJG, Marshall MS, et al. (2016) The health care
and life sciences community profile for dataset descriptions.
PeerJ 4:e2331 https://doi.org/10.7717/peerj.2331
1 December 2016
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
2
3. Requirements
• Online tool
– Deployable on W3C
server
– GUI
– API
• Support multiple
constraints
– Properties
– Data values
– …
• Requirement levels
– Different levels of
user messages:
Error, Warning,
Information
• Configurable
– HCLS (Required)
– DCAT, Open
PHACTS, etc
(Optional)
1 December 2016
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
3
4. Example Constraint
1 December 2016 4
• Shape
• A Dataset
– MUST be declared to be of type dctype:Dataset
– MUST have a dcterms:title as a language typed
string
– MUST NOT have dcterms:created date
<Dataset> rdf:langString
.
✗
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
Dates are associated
with versions in HCLS
5. Example Validation
1 December 2016 5
<Dataset> rdf:langString
.
✗
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
• Shape
• Data
6. Example Validation
• Shape
• Data
1 December 2016 6
<Dataset> rdf:langString
.
✗
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
7. Example Validation
1 December 2016 7
<Dataset> rdf:langString
.
✗
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
• Shape
• Data
13. Validata
https://github.com/HW-SWeL/Validata
• RDF constraint validation tool
– Configurable to any profile
• Shape Expression (ShEx) constraints
• Open source javascript implementation
www.macs.hw.ac.uk/~ajg33/
A.J.G.Gray@hw.ac.uk
@gray_alasdair
Editor's Notes
Motivation: how do we check descriptions conform?
Summary level: time unchanging information, e.g. name, description, publisher
Version level: version specific information, e.g. version number, creator, etc
Distribution level: file specific information, e.g. file location and format, number of triples
18 vocabularies: DCTerms, DCAT, VoID, FOAF, …
61 prescribed properties: MUST, SHOULD, MAY, MUST NOT for each level
Link into data publishing pipeline via API
Not tied to HCLS, only a motivation
No existing tool meets these needs
Constraints form a graph pattern that data must comply with
How do we validate that our example data conforms to a certain shape
Express expected shape as ShEx
Toy example, what about for real
How do we validate that our example data conforms to a certain shape
Express expected shape as ShEx
Toy example, what about for real
How do we validate that our example data conforms to a certain shape
Express expected shape as ShEx
Toy example, what about for real
ShEx:
Concise notation
regex based
W3C SHACL not stable when work done
ShEx is an implementation of SHACL with extra features
Step through validation process
Extended ShEx to allow arbitrary hierarchies
Toy example, what about for real
ShEx-validator has other dependencies too
Minimist: arguments parser
Promise: call backs
Pegjs: parser generator
Mocha: test driven development