Semantic models for cdisc based standards and metadata management (1)
Upcoming SlideShare
Loading in...5
×
 

Semantic models for cdisc based standards and metadata management (1)

on

  • 1,382 views

 

Statistics

Views

Total Views
1,382
Views on SlideShare
1,381
Embed Views
1

Actions

Likes
0
Downloads
52
Comments
0

1 Embed 1

http://www.linkedin.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Semantic models for cdisc based standards and metadata management (1) Semantic models for cdisc based standards and metadata management (1) Document Transcript

  • CDISC EU Interchange 2012Semantic Models for CDISC Based Standards and Metadata ManagementIntroductionWe have possibly come at a critical turning point in the way clinical data can be managed, usedand reused within and across organizations. The coverage and maturity of existing CDISCstandards, the establishment of these standards within the industry at large, the use of thesestandards as a foundation for metadata driven systems, and the upcoming role of semanticstandards are all converging to create new and unique opportunities. In this presentation we lookat the implications and challenges of integrating CDISC standards, metadata, and informationmodels into a single framework. We also show how semantic standards can provide a solidfoundation in building such a framework.CDISC StandardsThe role of data standards for the management of clinical data has shifted significantly over thepast few years, largely due to the establishment of CDISC standards across the pharmaceuticalindustry. Not so long ago, sponsors had to consider if and when they should use SDTM standardsfor FDA submissions. Today, those questions have changed. Not if and when, but how to bestadopt CDISC based data standards is becoming the leading question. This change in mindset isin itself a major step forward, but also leads to formidable challenges, both for CDISC as theowner of the standards, for sponsors integrating these standards into their own organizations, forvendors providing products and services, and for regulatory organizations to review submitteddata.A key challenge for any set of standards is to be consistent and complete. Looking at the CDISCstandards, we see a variety of standards at different levels of maturity. The SDTM standards,domains and terminology seem to have the highest level of adoption to date, but as moresponsors submit data according to those standards, its shortcomings become magnified. SDTM isan informal model and in many instances open for interpretation. This leads to inconsistencies in Page | 1
  • CDISC EU Interchange 2012how collected data is mapped to SDTM, potentially across studies from a single sponsor, butdefinitely across studies from different sponsors. As sponsors get comfortable adopting theSDTM standards, they naturally venture into the CDASH and ADaM standards. These standardshave had a shorter life time and have not yet reached the maturity level of SDTM while sufferingfrom similar problems. In addition, issues about consistency at the content and representationallevels across the CDISC standards come into focus as well. This is highlighted by the disconnectbetween the standards just mentioned and the BRIDG model, a comprehensive domain analysismodel for protocol-driven biomedical and clinical research, captured as a UML model.Sponsors adopting CDISC have to deal with these issues. They also face the challenge to manageand integrate CDISC based data standards within their respective organizations at theinformation architecture, process, and systems application level. In the following sections weoutline some fundamental principles that can help meet these challenges.Information ArchitectureWe already indicated the importance for a set of standards to be complete and consistent. Formalmodels make these notions precise. Another observation is that the content of the CDISCstandards depends on the meaning of what is studied in the biological and clinical reality (oftenreferred to as concepts), and how these concepts are represented by data elements from protocolto submission, i.e. we are dealing with semantic and metadata information about biomedical andclinical research knowledge and data. The conclusion is immediate and striking. An informationarchitecture taking this into account needs to be based on a formal ontological metadata model.Well placed to get the job done are semantic models based on the W3C semantic web standards(RDF, OWL, SKOS). These standards provide the means to define a formal representation of abody of knowledge. In short, the Resource Description Framework (RDF) specifies a generalmodel of how any piece of knowledge can be represented by statements of the form Subject-Predicate-Object or Subject-Predicate-Value, called triples. Each part of a triple (except Value)has a Uniform Resource Identifier (URI), and triples can be aggregated into graphs with subjectand objects as nodes, and predicates as arcs. The Web Ontology Language (OWL) adds a typingmechanism to classify subjects and objects into a hierarchy of classes and defines modelingconstructs to express knowledge about predicates. This gives a rich modeling vocabulary to buildschemas and the capability to derive new triples from existing triples (inference). Finally, theSimple Knowledge Organization System (SKOS) is a thin RDF based vocabulary that can beused to build terminologies. See [2] for more information on RDF based standards.A knowledge base written in RDF can easily be shared between systems by serializing it intoformats such as RDF/XML. RDF knowledge bases are also easy to federate and cross-referenceas witnessed by the development of the Linked Open Data (LOD) cloud, a large amount of openand cross-linked RDF data sets available on the web today. In this context it should be noted that Page | 2
  • CDISC EU Interchange 2012an OWL version of the NCI Thesaurus (the source for CDISC’s controlled terminologies) isfreely available today in an RDF/XML format. Also, an effort is well on its way to port theBRIDG UML model to an OWL based ontology.Looking across the CDISC standards, we notice that the content is itself metadata, hence theRDF schema we have in mind corresponds to a level 3 meta-model. A good starting point here isthe ISO 11179 standard for metadata registries (MDR). This standard is a bit elaborate and notthat widely adopted, but it is does provide a good starting point to develop a small and genericOWL vocabulary for metadata models, including most notably the capability of item levelversioning for anything that goes into a metadata registry. Using an ISO 11179 based OWLvocabulary, it is fairly straightforward to create a knowledge base for the CDASH, SDTM, andADaM standards.Finally, there is a need to eliminate any possible interpretation and to guarantee consistencybetween the different CDISC standards. A biomedical concept model, representing the meaningof what is studied in the biological and clinical reality, can provide the glue to hold everythingtogether. It provides common and precise semantic content for any CDASH, SDTM, and ADaMdata element, and restricts these standards to have only representational capabilities. On the otherside of the coin, an RDF based biomedical concept model can link directly into other RDFsources with semantic content such as the NCI Thesaurus and BRIDG once its OWLrepresentation is available.Our considerations on an information architecture for CDISC standards based on semantic webstandards lead to the following RDF based information stack. Sponsor Extensions CDASH SDTM ADaM Biomedical Concept Model ISO 11179 MDR Schema (subset) BRIDG and ISO 21090 NCI Thesaurus RDF OWL SKOS Figure 1 Page | 3
  • CDISC EU Interchange 2012Notice that the top layer offers sponsors the opportunity to extend content based on existing RDFschemas, e.g. sponsors may add additional SDTM data elements as supplemental qualifiers, orintroduce additional RDF schemas to cover new types of content.CDISC ConsiderationsThe CDISC standards have come a long way, both in terms of maturity and adoption, but alsoface considerable challenges as more sponsors use the standards, and even more so as substantialcontent is expected to be added for therapeutic areas. A layered information architecture basedon semantic standards can provide a solid foundation to systematically address these challenges.The CDISC SHARE project may be the best place to get such an effort on its way, but willrequire substantial commitment from CDISC as a whole to be successful. Just recently we haveprovided a first draft OWL model to give a home to the ideas that the SHARE team has beenworking on over the past few years. The future roadmap however seems to be unclear at bestwith no firm commitment to implementation goals and time lines. At the same time the SHAREteam is already producing much valuable content that fits extremely well in the biomedicalconcept model.Sponsor and Vendor ConsiderationsRight now we seem to have come at a turning point, driven by a widespread adoption of CDISCstandards and an emerging need for sponsors to establish a standards management functionwithin their respective organizations. Large organizations have increasing difficulty just dealingwith the resulting work load of managing and applying clinical data standards. This naturallyleads to the need for a metadata repository (MDR).The same arguments for the information architecture given earlier apply even more here.RDF/XML represents an RDF interface format for MDR content. As indicated before, it caneasily be shared and federated, but also loaded into a triple store database. Since an RDFknowledge base can carry its own schema and everything is represented by triples, the triplestore load is immediate and the RDF knowledge base directly represents the MDR content.Two examples of how sponsors have started to implement semantic standards and apply linkeddata principles: At Roche this is done by implementing an internally built MDR, see more detailsbelow. At AstraZeneca the requirements on a commercial MDR product will include an interfaceto MDR content based on semantic standards and linked data principles. This is part of a largereffort called integrative informatics (i2) establishing the components to let a Linked Data cloudgrow across AstraZeneca R&D. Page | 4
  • CDISC EU Interchange 2012MDR Based Standards Implementation at RocheIn a first phase, Roche has successfully defined a set of clinical trial data standards based on theCDISC, ISO 11179 MDR, and the W3C semantic standards following the architecture shownearlier in Figure 1. In this implementation, the biomedical concept model has deliberately beendesigned as a thin layer in anticipation that CDISC SHARE is going to give this part of the stacklater on. BRIDG can be added as soon as its OWL representation becomes available. The datacollection and data tabulation standards cover all of safety and the Roche therapeutic areas, but isonly partially based on CDASH. Data analysis standards are still in their infant stages.In a second phase, Roche has built an MDR and an application infrastructure in 2011. Thisincludes a controlled mechanism to publish the RDF stack to a triple store database, a webbrowser application to deliver the content to end-users, and a set of web services to provideaccess to other applications. The MDR includes item level versioning following ISO 11179 andis deployed in a high availability IT production environment. The next release is scheduled toinclude semantic search and linking from the biomedical concept model into the NCI Thesaurus.The good news for sponsors is that semantic technology has proven to work at all levels, fromW3C standards to semantic toolsets such as modeling workbenches, triple store databases, andapplication programming interfaces (API).Roche is now entering a third phase to establish MDR driven workflow automation fromprotocol to submission. The idea is to implement a semantic representation of the protocol anddata analysis plan, and from there use the MDR content to support study build, provide datatransformation services to derive SDTM mappings, and finally support the production of dataanalysis and submission deliverables.References1. To read more on knowledge systems and semantic modeling, the following is recommended.  Dean Allemang and Jim Hendler. Semantic Web for the Working Ontologist. Second Edition. Morgan Kaufmann, 2011. This is an excellent book, well-written, specifically on the modeling aspects of RDF and OWL.  Christopher Walton. Agency and the Semantic Web. Oxford University Press, 2007. This book gives a broad outlook on knowledge systems and the semantic web, including more academic background on the computational aspects of the subject.  Dragan Gasevic, Dragan Djuric, and Vladan Devedzic. Model Driven Engineering and Ontology Management. Second Edition. Springer, 2009. This book provides valuable insight on knowledge engineering and the relationship between the different modeling spaces. Page | 5
  • CDISC EU Interchange 20122. Here is a good entry page to locate the W3C standards for the semantic web, in particular the RDF, RDFS, OWL, and SKOS standards: http://www.w3.org/2001/sw/wiki/Main_Page3. To see what the National Cancer Institute (NCI) is doing in the area of controlled terminologies and ontology modeling, have a look here: https://cabig.nci.nih.gov/concepts/EVS/4. The National Center for Biomedical Ontology (NCBO) is a great resource for biomedical ontologies and related technologies. It can be accessed here: http://www.bioontology.org/ Page | 6