CDISC2RDF
Making clinical data standards linkable, computable and queryable
The CDISC2RDF initiative exploits Semantic
Web standards and Linked Data principles for
clinical data standards from CDISC (Clinical
Data Interchange Standards Consortium).
Introduction
Clinical data standards have been identified as one of five
initial areas by the TransCelerate BioPharma, the non-profit
organization formed by ten leading pharmaceutical companies,
to accelerate the development of new medicines.
The European Medicines Agency (EMA) is developing a policy
on the proactive publication of clinical-trial data in the interests
of public health including clear and understandable clinical
data formats. The FDA has a long-held goal of making better
use of submitted clinical trial data. Pharmaceutical companies
have attempted to use submission standards to create study
repositories.
Exploiting Semantic Web technologies stands to simplify the
interpretation of individual studies, and improve cross-study
integration.
Kerstin Forsberg, Informatics Scientist
kerstin.l.forsberg@astrazeneca.com
Analysis, Informatics & Knowledge Engineering Practice, AstraZeneca, Sweden
CDISC2RDF Schemas
The first version of the core CDISC2RDF schemas were
intentionally developed to represent a minimal part of the
ISO11179 model for metadata registries.
The Meta Model Schema (mms) represents the core Data
Description part of the ISO11179 model, Part 3: Registry
metamodel and basic attributes
From human readable documentation and “Text strings”
In the domain of clinical research CDISC, a non-profit
organization, have developed standards for study design
(SDM), study data collection (CDASH), study data analysis
(ADAM), and submission to the regulatory bodies (SDTM).
These represent a limited set of data elements with names
such as “RACE“, that also have a value set derived from NCI
Thesaurus. However, most of the data elements are
containers for contextual variables with names such as
“VSDATE” and “AEACN” (Date of measurement of Vital Signs and
Action Taken for Adverse events), and of the data elements for
the results of the measurements. These are indirectly indicated
in variables called “TESTCD” with a term, or rather a text string
such as “DIABP”, “BMI”, “HGB” representing the measurement
procedures, “ listed in the so called controlled terminologies
(CT) for SDTM (Study Data Tabulation Model).
Today all data standards and controlled terminologies, are
published as PDF:s, Excel , and traditional XML, by CDISC
and NCI EVS.
Human readable documentation in
PDF:s, Excel:s (and some in XML)
CDISC2RDF Schemas
(based on the core of ISO11179)
Machine processable linked
data structured as RDF triples
Meta model schema
(mms)
(Data definition, the core part of ISO 11179)
Controlled Terminology schema
(cts)
(a few additional properties
from the NCI Thesaurus export)
SDTM 1.2 schema
(sdtms)
(classifiers: Data Element roles and types)
SDTM 3.1.2 IG schema (sdtmigs)
(a few additional properties)
To machine processable RDF triples and “URI:s”
The first deliverable from the CDISC2RDF project was
published early 2013. It contained OWL/RDF files (triples) for
CDISC submission standards: SDTM 1.2, Implementation
Guideline (IG) 3.1.2 and Controlled Terminology (CT), plus
CTs for data capture standards (CDASH) and analysis
standards (ADaM).
Each data element / column, dataset, code list, classifier etc.
have got URI:s (Uniform Resource Identifiers) assigned to
them:
Meta model schema
(mms)
(Data definition, the core part of ISO 11179)
The SDTM schema (sdtms) version 1.2 defines additional
classifiers in the underlying model such as the data
element role: Record Qualifier and also Expected variable.
The Controlled Terminology schema (cts) adds to the
metadata model schema (mms) a few additional
classifications and properties to represent the existing NCI
Thesaurus EVS export.
The classes and properties are being used to annotate the
Excel column headers and the standard import
functionality in the TopBraid Composer tool have been
used to create the RDF triples in XML, Turtle, and JSON
formats.
CDISC2RDF started as a cross-pharma pre-
competitive project with AstraZeneca, Roche,
TopQuadrant, Free University of Amsterdam
and W3C HCLS to show case the use of
Semantic Web standards and Linked Data
principles.
It is now incorporated in the Semantic
Technology project, part of the FDA/PhUSE
working group on Emerging Technologies with
representatives across FDA, CDISC, pharmas,
CRO:s and software vendors.
We want to push back to CDISC and NCI, and other public and internal standard
groups, and show in practice how to “Use (semantic web) standards for standards”
http://rdf.cdisc.org/sdtmig-3-1-2/std#Column.AE.AEACN
http://rdf.cdisc.org/sdtmig-3-1-2/std#Table.AE
http://rdf.cdisc.org/sdtm-1-2/schema#Classifier.RecordQualifier
All OWL/RDF files, schemas and standards
are available on https://code.google.com/p/cdisc2rdf/

CDISC2RDF poster for Conference on Data Integration in the Life Sciences 2013

  • 1.
    CDISC2RDF Making clinical datastandards linkable, computable and queryable The CDISC2RDF initiative exploits Semantic Web standards and Linked Data principles for clinical data standards from CDISC (Clinical Data Interchange Standards Consortium). Introduction Clinical data standards have been identified as one of five initial areas by the TransCelerate BioPharma, the non-profit organization formed by ten leading pharmaceutical companies, to accelerate the development of new medicines. The European Medicines Agency (EMA) is developing a policy on the proactive publication of clinical-trial data in the interests of public health including clear and understandable clinical data formats. The FDA has a long-held goal of making better use of submitted clinical trial data. Pharmaceutical companies have attempted to use submission standards to create study repositories. Exploiting Semantic Web technologies stands to simplify the interpretation of individual studies, and improve cross-study integration. Kerstin Forsberg, Informatics Scientist kerstin.l.forsberg@astrazeneca.com Analysis, Informatics & Knowledge Engineering Practice, AstraZeneca, Sweden CDISC2RDF Schemas The first version of the core CDISC2RDF schemas were intentionally developed to represent a minimal part of the ISO11179 model for metadata registries. The Meta Model Schema (mms) represents the core Data Description part of the ISO11179 model, Part 3: Registry metamodel and basic attributes From human readable documentation and “Text strings” In the domain of clinical research CDISC, a non-profit organization, have developed standards for study design (SDM), study data collection (CDASH), study data analysis (ADAM), and submission to the regulatory bodies (SDTM). These represent a limited set of data elements with names such as “RACE“, that also have a value set derived from NCI Thesaurus. However, most of the data elements are containers for contextual variables with names such as “VSDATE” and “AEACN” (Date of measurement of Vital Signs and Action Taken for Adverse events), and of the data elements for the results of the measurements. These are indirectly indicated in variables called “TESTCD” with a term, or rather a text string such as “DIABP”, “BMI”, “HGB” representing the measurement procedures, “ listed in the so called controlled terminologies (CT) for SDTM (Study Data Tabulation Model). Today all data standards and controlled terminologies, are published as PDF:s, Excel , and traditional XML, by CDISC and NCI EVS. Human readable documentation in PDF:s, Excel:s (and some in XML) CDISC2RDF Schemas (based on the core of ISO11179) Machine processable linked data structured as RDF triples Meta model schema (mms) (Data definition, the core part of ISO 11179) Controlled Terminology schema (cts) (a few additional properties from the NCI Thesaurus export) SDTM 1.2 schema (sdtms) (classifiers: Data Element roles and types) SDTM 3.1.2 IG schema (sdtmigs) (a few additional properties) To machine processable RDF triples and “URI:s” The first deliverable from the CDISC2RDF project was published early 2013. It contained OWL/RDF files (triples) for CDISC submission standards: SDTM 1.2, Implementation Guideline (IG) 3.1.2 and Controlled Terminology (CT), plus CTs for data capture standards (CDASH) and analysis standards (ADaM). Each data element / column, dataset, code list, classifier etc. have got URI:s (Uniform Resource Identifiers) assigned to them: Meta model schema (mms) (Data definition, the core part of ISO 11179) The SDTM schema (sdtms) version 1.2 defines additional classifiers in the underlying model such as the data element role: Record Qualifier and also Expected variable. The Controlled Terminology schema (cts) adds to the metadata model schema (mms) a few additional classifications and properties to represent the existing NCI Thesaurus EVS export. The classes and properties are being used to annotate the Excel column headers and the standard import functionality in the TopBraid Composer tool have been used to create the RDF triples in XML, Turtle, and JSON formats. CDISC2RDF started as a cross-pharma pre- competitive project with AstraZeneca, Roche, TopQuadrant, Free University of Amsterdam and W3C HCLS to show case the use of Semantic Web standards and Linked Data principles. It is now incorporated in the Semantic Technology project, part of the FDA/PhUSE working group on Emerging Technologies with representatives across FDA, CDISC, pharmas, CRO:s and software vendors. We want to push back to CDISC and NCI, and other public and internal standard groups, and show in practice how to “Use (semantic web) standards for standards” http://rdf.cdisc.org/sdtmig-3-1-2/std#Column.AE.AEACN http://rdf.cdisc.org/sdtmig-3-1-2/std#Table.AE http://rdf.cdisc.org/sdtm-1-2/schema#Classifier.RecordQualifier All OWL/RDF files, schemas and standards are available on https://code.google.com/p/cdisc2rdf/