An Introduction to
CCDH
Joint meeting of the CRDC & the Center for Cancer Data Harmonization
Date: June 29, 2020
https://datascience.cancer.gov/data-commons/center-cancer-data-harmonization-ccdh
These slides: bit.ly/ccdh-crdc-june-2
These slides:
bit.ly/ccdh-crdc-june-2
Joint meeting of the CRDC & the Center for Cancer Data Harmonization
Date: June 29, 2020
https://datascience.cancer.gov/data-commons/center-cancer-data-harmonization-ccdh
Outline
● Synthesis of information from CRDC and insights
derived | Sam, Melissa
● Presentation of Harmonized Data Model | Brian & Matt
● Ontology landscape and terminological requirements |
Jim, Harold, Dazhi
Community
Development
(Lead: Volchenboum;
Co-Lead: Vasilevsky)
Data Model
harmonization
(Lead: Chute;
Co-Lead: Furner)
Ontology &
Terminology
Ecosystem
(Lead: Solbrig)
Tools & Data Quality
(Lead: Balhoff)
Programmatic oversight
CBIIT: Sherri De Coronado, Allen Dearry
FNL: Todd Pihl, Resham Kulkarni
Program Management and operations
(Lead: Haendel, Co-Lead: Munoz-Torres)
Role of CCDH in the CRDC ecosystem
Facilitate retrospective and prospective
semantic harmonization of data across
nodes of the CRDC
Coordinate the community to ensure quality
“fit for purpose” design and implementation
of standards that will facilitate
interoperability of heterogeneous data
types and CRDC resources
Find agreement across the communities
built around CRDC
- match and extend data models
- annotation, harmonization
- quality assurance
Data Model
harmonization
(Lead: Chute
Co-Lead: Furner)
Ontology &
Terminology
Ecosystem
(Lead: Solbrig)
Tools & Data
Quality
(Lead: Balhoff)
Schema to
schema
OMOP to
FHIR
Term to
Term
Oncotree to
NCIt
Data records to
data records
“Smoking status 3
packs per day” to
NCIT:C154510
[Heavy Smoker]
Synthesis of information from
CRDC and insights derived
Community Development Working Group
Goals:
● Engage CRDC stakeholders: interviews to identify and document semantic priorities
● Document current platforms
● Develop plans to support core semantic standards and concierge services
Completedinterviews
DCF: Data Commons Framework -
Infrastructure
Node
HTAN: Human Tumor Atlas Network
ICDC: Integrated Canine Data Commons
IDC: Imaging Data Commons
GDC: Genomic Data Commons
PDC: Proteomics Data Commons
Futureinterviews
Gabriella Miller Kids First Data Resource
Center
Node
CDS: Cancer Data Services
Broad Institute FireCloud
Institute for Systems Biology
SevenBridges
NBIA: National Biomedical Imaging
Archive
SEER Virtual Tissue Repository
CIDC: Cancer Immunology Data
Commons
Summary matrix from initial interviews
Community Development - Phase II - Pilot
● Provision of help desk services (office hours and GitHub issue tracker)
● Data preparation services
○ mapping and transformations of terminologies and models
○ metadata validation
○ data annotation
● Web portal development
● Work with the nodes to assist mapping and transformation of data
● Develop user support documentation and materials
Main user base is the node developers But these users will also benefit
Establish a
transparent
process for
community
discussion,
modification,
and acceptance
of new or
modified
content (GitHub)
Community Development - Phase III/IV - Production and Operations
Concierge
services for
CRDC nodes,
DCC, DCF,
other end users
Continue
collecting user
questions and
feedback to
improve
services and
identify user
needs and pain
points
Enable the
users to find the
resources they
need and to be
able to use the
portal
independently
Web portal
enhancements /
load testing
Unit tests / QC
CCDH Harmonized Data Model
● Will provide a single data model that harmonizes
syntax and semantics across the CRDC systems
and services.
● This CRDC-H model will enable data
aggregation and exchange to facilitate
integrated search, navigation, and
metadata-based analysis
● We will align with community standards where
possible (e.g. FHIR, BRIDG) to promote broader
interoperability, and leverage mappings and
tools provided by these efforts
Data Model Harmonization: Overview
Ecosystem of CRDC repositories,
services and stakeholders
1. Standardize Source Data
Model Documentation
2. Generate an Aggregated
Data Model (ADM)
3. Map the ADM to
Community Standard Data
Models
4. Refactor the ADM into a
Conceptual Domain
Model (CDM)
5. Refactor the CDM to a
Logical Data Model
(CRDC-H) An iterative process through which source model content is evaluated, aggregated,
mapped, and refactored into a standards- aligned and harmonized data model.
CRDC-H Model Development Workflow
Abstract specification
Low harmonization
Not standards-aligned
Concrete specification
Deep harmonization
Standards-aligned
● Targeted four source models
(GDC, PDC, ICDC, HTAN)
● Focused on Biospecimen
and Administrative
subdomains
● Harmonized entities and
attributes, not data types or
value sets/terminologies
● Informed by BRIDG and FHIR
standards
● Produced an exploratory
conceptual model (does not
yet support implementation) Lessons learned from this narrow but deep dive will inform subsequent
iterations that incorporate new data sources and subdomains.
Phase I: CDM Prototype Development
Abstract specification
Low harmonization
Not standards-aligned
Concrete specification
Deep harmonization
Standards-aligned
The Aggregated Data Model (ADM)
GDC
26 entities,
561 attributes
ADM
55 entities,
984 attributes
PDC
27 entities,
500 attributes
ICDC
27 entities,
265 attributes
The Aggregated Data Model (ADM)
A substrate
for
refactoring
into more
deeply
harmonized
models
Node models are not well aligned at the outset
● e.g. ICDC and GDC: ~30% entity equivalence , <5% attribute equivalence
Property aggregation in the ADM is based on superficial analysis and strict
aggregation criteria - so harmonization is minimal
● Only strictly equivalent elements within strictly equivalent entities are merged
Deeper aggregation and harmonization of elements will be achieved as the
ADM is refactored into the CDM
● Terminological - e.g. GDC 'Treatment' vs ICDC 'Agent Administration’
● Structural - e.g. ICDC provides a more normalized model for clinical metadata
● Semantic - e.g. harmonizing disease terminologies used across systems / species
● Precision - e.g. variable detail provided about tumor staging across models
High-Level
Structural
Changes
Resulting
from ADM
Refactoring
(Biospecimen
Subdomain)
The Conceptual Domain Model (CDM) Prototype
1. Specialization: Specialized specimen subtypes in the ADM get collapsed
2. Normalization: Data elements get distributed across a larger set of entities
3. Harmonization: Refactoring reduces total number of properties by half
ADM
refactoring
144 specimen
properties in
total
CDM
74 specimen
properties in
total
Refactoring results in a much more normalized and deeply harmonized CDM model
UML
Diagram of
CDM
Entities
and
Attributes
(link)
The Conceptual Domain Model (CDM)
Entities in the CDM
prototype, and the
attributes held by each
Attribute count shown in
parentheses.
CDM Data
Dictionary
(link)
● The CDM prototype is specified as a spreadsheet-based data dictionary
● Entities and their Attributes are each described in a separate sheet
● Cardinality of attributes is specified to be as permissive as possible initially
● Data Types are minimally specified
○ Simple: declared only at a high level (limited to literal, boolean)
○ Complex: proposals for Identifier, Coding, DateTime, Quantity, . . .
● A ‘Referenced Entities’ sheet lists entities that are referenced in CDM relationships,
but are not in scope to model in this phase of work.
○ e.g. Organization, Visit, ConditionDiagnosis
● A ‘Data Containers’ sheet holds placeholders for objects that will be defined to group
sets of related properties (specific structures for these t.b.d.)
● Mappings of several types are also provided in the main Entity sheets:
○ ADM attributes that map to each CDM attribute (column L)
○ Source node attributes aggregated by these ADM attributes (column M)
○ CDM to FHIR mappings (column N)
The Conceptual Domain Model (CDM)
The Conceptual Domain Model (CDM)
Excerpt from the ‘Specimen’ sheet of the CDM Data Dictionary (link)
Attribute Definitions Mappings
BRIDG bridgmodel.nci.nih.gov
● A detailed and highly-normalized conceptual model
covering the domains of clinical and translational research
(a mapping ‘hub’, not an implementation model)
● ADM mappings to BRIDG support a deeper
understanding of source model elements, keep our data
model grounded in reality, and enable cross-mapping to
other BRIDG-mapped models
FHIR hl7.org/fhir
● A data exchange model and API framework covering
patient-level healthcare information generated in EHRs
● ADM mappings to FHIR provide a pragmatic target to
guide ADM->CDM refactoring, as alignment can enable
interoperability with clinical data systems, and potentially
lets us leverage FHIR infrastructure and tools
Mapping CCDH Models to BRIDG and FHIR
Mappings from Sources and the CDM to BRIDG and FHIR can be derived from ADM mappings to each of these models
BiologicSpecimen <--beAFunctionPerformedBy-- Subject
<--beParticipatedInBy-- PerformedMaterialProcessStep.methodCode
WHERE PerformedMaterialProcessStep --instantiate→
DefinedMaterialProcessStep.nameCode="freeze"
BRIDG mapping path for ADM.Sample.freezing_method:
FHIR elements required to represent ADM Sample
● Test / validate the CDM
prototype against node data,
competency questions, and
feedback from stakeholders.
● Incorporate additional CRDC
source models into the ADM
(e.g. IDC) (Steps 1 and 2)
Phase II Activities: Multiple Workstreams in Parallel
● Refactor additional ADM subdomains into the CDM (e.g. clinical metadata) (Steps 3 and 4)
● Evolve mature CDM content into an implementable logical model (Step 5)
● Terminological / value set harmonization
Key CCDH Modeling Work Products
ID Name Description
Archived
Document
WP0 May 2020 Phase 1 Report Short document describing work performed and products generated in this phase of work. gdoc
WP1 BRIDG and FHIR Mappings A spreadsheet with detailed and provenanced mappings of ADM elements to BRIDG and FHIR xls
WP2 BRIDG and FHIR Covering Model Diagrams UML-like views of elements in the BRIDG and FHIR models required to represent ADM entities. pdf
WP3 CDM Entity and Attribute Diagram A class diagram providing a high-level view of the CDM pdf
WP5 CDM Dictionary (and Mappings)
A data dictionary spreadsheet detailing the Conceptual Domain Model, and its attribute-level mappings to the
ADM and FHIR.
gsheets
WP6 ADM Representation in FHIR A representation of ADM entities using FHIR metamodeling language and tooling gdoc
ID Name Description
Archived
Document
WP1 CRDC Node concept maps A side-by-side view of the core models implemented by GDC, PDC, and ICDC nodes. png
WP3 CRDC Data Model Dictionaries One document with separate spreadsheets for GDC, PDC, and ICDC models. gsheets
WP5 Aggregated Model Concept Map A high level view of the entities and relationships in the aggregated model. png
WP6 Aggregated Data Dictionary Spreadsheets describing all elements from the Aggregated Model, and mappings to source elements. gsheets
May 2020 Deliverable Package
February 2020 Deliverable Package
Ontology landscape and
requirements for
terminologies and tools
Delivering terminological & data model content to support data
ingest / data harmonization within each node
● Provide tools to facilitate use of the harmonized data model and terminology by
nodes
○ Harmonized data and terminologies enable access to data via CDA
● Metadata validation leveraging the harmonized terminology
● Mapping incoming datasets to the harmonized model
● Migration across harmonized model versions
● Leverage existing tools, existing terminologies, where possible
Behind every data model are the tools and terminologies that make it work
Terminology tools and services landscape assessment
What already exists? What can be best utilized or adapted for the CRDC? What are the gaps?
Admin/Access
Licensing
Registration
Authentication
Publication
Version management
Change management
Automated updates
UI/Browse/Search
Term search/Autocomplete
UI for navigation
Querying, filtering
Synonym support
Visualization
Community use indicated/tracked
API
Standard
Named entity recognition
Validation
Transitive closure
Identifiers
URIs
Dereferencing
Mapping
Serves maps
Map curation and authoring
Map validation
Value set services
Formats
Semantic typing
Inputs, outputs, OWL2, etc.
Data annotation and QC tools
What already exists? What can be best utilized or adapted for the CRDC? What are the gaps?
Mapping and Transformation
standardization
NLP/named entity recognition
semantic similarity
Metadata Validation and QC
value sets
logical constraints
syntax
Data Annotation
template building
term search
terminology browsing
Examples
CEDAR
Ptolemy.V
Metadata Validation Service
Simple Terminology Server
FHIR Terminology Server
OpenRefine
RDF shapes (ShEx/SHACL)
ISO 11179 Metadata Registries (MDR)
● Provenance / history
● Contacts / managing organization
● Semantics - what the elements in a data model represent
ISO 11179-3 - registry metamodel and basic attributes carry a standard model of
“binding” -- how one associates ontology meaning with both the data element itself
and its content.
Standard for recording information about data models
ISO 11179 Model of meaning / Model of representation
Current roles in caDSR + NCI Thesaurus
RDF as the great “blender”
ADM Models
Represented
using FHIR
Metamodel,
and generated
documentation
https://fhir.hotecosystem.org/ccdh/fhir/, https://fhir.hotecosystem.org/ccdh/fhir/aliquot.html
FHIR as a Modeling Framework
FHIR into the RDF blender
RDF blender to FHIR
Putting it all together
Model in Google Sheets
Putting it all together -- do we need a unifying representation?
Model in Google Sheets
Acknowledgments
Center for Biomedical Informatics &
Information Technology
● Allen Dearry
● Sherri de Coronado
● Erika Kim
● Denise Warzel
● Melissa Cook
Samvit Solutions
● Smita Hastak
● Wendy Ver Hoef
● Charles Yaghmour
● Todd Pihl
● Resham Kulkarni
Frederick National Laboratory
for Cancer Research
DCF: Data Commons Framework - Infrastructure
HTAN: Human Tumor Atlas Network
ICDC: Integrated Canine Data Commons
IDC: Imaging Data Commons
GDC: Genomic Data Commons
PDC: Proteomics Data Commons
SevenBridges
Gabriella Miller Kids First Data Resource Center
CDS: Cancer Data Services
Broad Institute FireCloud
Institute for Systems Biology
NBIA: National Biomedical Imaging Archive
SEER Virtual Tissue Repository
CIDC: Cancer Immunology Data Commons
Cancer Data
Aggregator
● Brian O’Connor
● Alex Baumann
● David Pot
● Jack DiGiovanna
● Cara Mason

An Introduction to CCDH

  • 1.
    An Introduction to CCDH Jointmeeting of the CRDC & the Center for Cancer Data Harmonization Date: June 29, 2020 https://datascience.cancer.gov/data-commons/center-cancer-data-harmonization-ccdh These slides: bit.ly/ccdh-crdc-june-2
  • 2.
    These slides: bit.ly/ccdh-crdc-june-2 Joint meetingof the CRDC & the Center for Cancer Data Harmonization Date: June 29, 2020 https://datascience.cancer.gov/data-commons/center-cancer-data-harmonization-ccdh
  • 3.
    Outline ● Synthesis ofinformation from CRDC and insights derived | Sam, Melissa ● Presentation of Harmonized Data Model | Brian & Matt ● Ontology landscape and terminological requirements | Jim, Harold, Dazhi
  • 4.
    Community Development (Lead: Volchenboum; Co-Lead: Vasilevsky) DataModel harmonization (Lead: Chute; Co-Lead: Furner) Ontology & Terminology Ecosystem (Lead: Solbrig) Tools & Data Quality (Lead: Balhoff) Programmatic oversight CBIIT: Sherri De Coronado, Allen Dearry FNL: Todd Pihl, Resham Kulkarni Program Management and operations (Lead: Haendel, Co-Lead: Munoz-Torres)
  • 5.
    Role of CCDHin the CRDC ecosystem Facilitate retrospective and prospective semantic harmonization of data across nodes of the CRDC Coordinate the community to ensure quality “fit for purpose” design and implementation of standards that will facilitate interoperability of heterogeneous data types and CRDC resources Find agreement across the communities built around CRDC - match and extend data models - annotation, harmonization - quality assurance
  • 6.
    Data Model harmonization (Lead: Chute Co-Lead:Furner) Ontology & Terminology Ecosystem (Lead: Solbrig) Tools & Data Quality (Lead: Balhoff) Schema to schema OMOP to FHIR Term to Term Oncotree to NCIt Data records to data records “Smoking status 3 packs per day” to NCIT:C154510 [Heavy Smoker]
  • 7.
    Synthesis of informationfrom CRDC and insights derived
  • 8.
    Community Development WorkingGroup Goals: ● Engage CRDC stakeholders: interviews to identify and document semantic priorities ● Document current platforms ● Develop plans to support core semantic standards and concierge services Completedinterviews DCF: Data Commons Framework - Infrastructure Node HTAN: Human Tumor Atlas Network ICDC: Integrated Canine Data Commons IDC: Imaging Data Commons GDC: Genomic Data Commons PDC: Proteomics Data Commons Futureinterviews Gabriella Miller Kids First Data Resource Center Node CDS: Cancer Data Services Broad Institute FireCloud Institute for Systems Biology SevenBridges NBIA: National Biomedical Imaging Archive SEER Virtual Tissue Repository CIDC: Cancer Immunology Data Commons Summary matrix from initial interviews
  • 9.
    Community Development -Phase II - Pilot ● Provision of help desk services (office hours and GitHub issue tracker) ● Data preparation services ○ mapping and transformations of terminologies and models ○ metadata validation ○ data annotation ● Web portal development ● Work with the nodes to assist mapping and transformation of data ● Develop user support documentation and materials Main user base is the node developers But these users will also benefit
  • 10.
    Establish a transparent process for community discussion, modification, andacceptance of new or modified content (GitHub) Community Development - Phase III/IV - Production and Operations Concierge services for CRDC nodes, DCC, DCF, other end users Continue collecting user questions and feedback to improve services and identify user needs and pain points Enable the users to find the resources they need and to be able to use the portal independently Web portal enhancements / load testing Unit tests / QC
  • 11.
  • 12.
    ● Will providea single data model that harmonizes syntax and semantics across the CRDC systems and services. ● This CRDC-H model will enable data aggregation and exchange to facilitate integrated search, navigation, and metadata-based analysis ● We will align with community standards where possible (e.g. FHIR, BRIDG) to promote broader interoperability, and leverage mappings and tools provided by these efforts Data Model Harmonization: Overview Ecosystem of CRDC repositories, services and stakeholders
  • 13.
    1. Standardize SourceData Model Documentation 2. Generate an Aggregated Data Model (ADM) 3. Map the ADM to Community Standard Data Models 4. Refactor the ADM into a Conceptual Domain Model (CDM) 5. Refactor the CDM to a Logical Data Model (CRDC-H) An iterative process through which source model content is evaluated, aggregated, mapped, and refactored into a standards- aligned and harmonized data model. CRDC-H Model Development Workflow Abstract specification Low harmonization Not standards-aligned Concrete specification Deep harmonization Standards-aligned
  • 14.
    ● Targeted foursource models (GDC, PDC, ICDC, HTAN) ● Focused on Biospecimen and Administrative subdomains ● Harmonized entities and attributes, not data types or value sets/terminologies ● Informed by BRIDG and FHIR standards ● Produced an exploratory conceptual model (does not yet support implementation) Lessons learned from this narrow but deep dive will inform subsequent iterations that incorporate new data sources and subdomains. Phase I: CDM Prototype Development Abstract specification Low harmonization Not standards-aligned Concrete specification Deep harmonization Standards-aligned
  • 15.
    The Aggregated DataModel (ADM) GDC 26 entities, 561 attributes ADM 55 entities, 984 attributes PDC 27 entities, 500 attributes ICDC 27 entities, 265 attributes
  • 16.
    The Aggregated DataModel (ADM) A substrate for refactoring into more deeply harmonized models Node models are not well aligned at the outset ● e.g. ICDC and GDC: ~30% entity equivalence , <5% attribute equivalence Property aggregation in the ADM is based on superficial analysis and strict aggregation criteria - so harmonization is minimal ● Only strictly equivalent elements within strictly equivalent entities are merged Deeper aggregation and harmonization of elements will be achieved as the ADM is refactored into the CDM ● Terminological - e.g. GDC 'Treatment' vs ICDC 'Agent Administration’ ● Structural - e.g. ICDC provides a more normalized model for clinical metadata ● Semantic - e.g. harmonizing disease terminologies used across systems / species ● Precision - e.g. variable detail provided about tumor staging across models
  • 17.
    High-Level Structural Changes Resulting from ADM Refactoring (Biospecimen Subdomain) The ConceptualDomain Model (CDM) Prototype 1. Specialization: Specialized specimen subtypes in the ADM get collapsed 2. Normalization: Data elements get distributed across a larger set of entities 3. Harmonization: Refactoring reduces total number of properties by half ADM refactoring 144 specimen properties in total CDM 74 specimen properties in total Refactoring results in a much more normalized and deeply harmonized CDM model
  • 18.
    UML Diagram of CDM Entities and Attributes (link) The ConceptualDomain Model (CDM) Entities in the CDM prototype, and the attributes held by each Attribute count shown in parentheses.
  • 19.
    CDM Data Dictionary (link) ● TheCDM prototype is specified as a spreadsheet-based data dictionary ● Entities and their Attributes are each described in a separate sheet ● Cardinality of attributes is specified to be as permissive as possible initially ● Data Types are minimally specified ○ Simple: declared only at a high level (limited to literal, boolean) ○ Complex: proposals for Identifier, Coding, DateTime, Quantity, . . . ● A ‘Referenced Entities’ sheet lists entities that are referenced in CDM relationships, but are not in scope to model in this phase of work. ○ e.g. Organization, Visit, ConditionDiagnosis ● A ‘Data Containers’ sheet holds placeholders for objects that will be defined to group sets of related properties (specific structures for these t.b.d.) ● Mappings of several types are also provided in the main Entity sheets: ○ ADM attributes that map to each CDM attribute (column L) ○ Source node attributes aggregated by these ADM attributes (column M) ○ CDM to FHIR mappings (column N) The Conceptual Domain Model (CDM)
  • 20.
    The Conceptual DomainModel (CDM) Excerpt from the ‘Specimen’ sheet of the CDM Data Dictionary (link) Attribute Definitions Mappings
  • 21.
    BRIDG bridgmodel.nci.nih.gov ● Adetailed and highly-normalized conceptual model covering the domains of clinical and translational research (a mapping ‘hub’, not an implementation model) ● ADM mappings to BRIDG support a deeper understanding of source model elements, keep our data model grounded in reality, and enable cross-mapping to other BRIDG-mapped models FHIR hl7.org/fhir ● A data exchange model and API framework covering patient-level healthcare information generated in EHRs ● ADM mappings to FHIR provide a pragmatic target to guide ADM->CDM refactoring, as alignment can enable interoperability with clinical data systems, and potentially lets us leverage FHIR infrastructure and tools Mapping CCDH Models to BRIDG and FHIR Mappings from Sources and the CDM to BRIDG and FHIR can be derived from ADM mappings to each of these models BiologicSpecimen <--beAFunctionPerformedBy-- Subject <--beParticipatedInBy-- PerformedMaterialProcessStep.methodCode WHERE PerformedMaterialProcessStep --instantiate→ DefinedMaterialProcessStep.nameCode="freeze" BRIDG mapping path for ADM.Sample.freezing_method: FHIR elements required to represent ADM Sample
  • 22.
    ● Test /validate the CDM prototype against node data, competency questions, and feedback from stakeholders. ● Incorporate additional CRDC source models into the ADM (e.g. IDC) (Steps 1 and 2) Phase II Activities: Multiple Workstreams in Parallel ● Refactor additional ADM subdomains into the CDM (e.g. clinical metadata) (Steps 3 and 4) ● Evolve mature CDM content into an implementable logical model (Step 5) ● Terminological / value set harmonization
  • 23.
    Key CCDH ModelingWork Products ID Name Description Archived Document WP0 May 2020 Phase 1 Report Short document describing work performed and products generated in this phase of work. gdoc WP1 BRIDG and FHIR Mappings A spreadsheet with detailed and provenanced mappings of ADM elements to BRIDG and FHIR xls WP2 BRIDG and FHIR Covering Model Diagrams UML-like views of elements in the BRIDG and FHIR models required to represent ADM entities. pdf WP3 CDM Entity and Attribute Diagram A class diagram providing a high-level view of the CDM pdf WP5 CDM Dictionary (and Mappings) A data dictionary spreadsheet detailing the Conceptual Domain Model, and its attribute-level mappings to the ADM and FHIR. gsheets WP6 ADM Representation in FHIR A representation of ADM entities using FHIR metamodeling language and tooling gdoc ID Name Description Archived Document WP1 CRDC Node concept maps A side-by-side view of the core models implemented by GDC, PDC, and ICDC nodes. png WP3 CRDC Data Model Dictionaries One document with separate spreadsheets for GDC, PDC, and ICDC models. gsheets WP5 Aggregated Model Concept Map A high level view of the entities and relationships in the aggregated model. png WP6 Aggregated Data Dictionary Spreadsheets describing all elements from the Aggregated Model, and mappings to source elements. gsheets May 2020 Deliverable Package February 2020 Deliverable Package
  • 24.
    Ontology landscape and requirementsfor terminologies and tools
  • 25.
    Delivering terminological &data model content to support data ingest / data harmonization within each node ● Provide tools to facilitate use of the harmonized data model and terminology by nodes ○ Harmonized data and terminologies enable access to data via CDA ● Metadata validation leveraging the harmonized terminology ● Mapping incoming datasets to the harmonized model ● Migration across harmonized model versions ● Leverage existing tools, existing terminologies, where possible Behind every data model are the tools and terminologies that make it work
  • 26.
    Terminology tools andservices landscape assessment What already exists? What can be best utilized or adapted for the CRDC? What are the gaps? Admin/Access Licensing Registration Authentication Publication Version management Change management Automated updates UI/Browse/Search Term search/Autocomplete UI for navigation Querying, filtering Synonym support Visualization Community use indicated/tracked API Standard Named entity recognition Validation Transitive closure Identifiers URIs Dereferencing Mapping Serves maps Map curation and authoring Map validation Value set services Formats Semantic typing Inputs, outputs, OWL2, etc.
  • 27.
    Data annotation andQC tools What already exists? What can be best utilized or adapted for the CRDC? What are the gaps? Mapping and Transformation standardization NLP/named entity recognition semantic similarity Metadata Validation and QC value sets logical constraints syntax Data Annotation template building term search terminology browsing Examples CEDAR Ptolemy.V Metadata Validation Service Simple Terminology Server FHIR Terminology Server OpenRefine RDF shapes (ShEx/SHACL)
  • 28.
    ISO 11179 MetadataRegistries (MDR) ● Provenance / history ● Contacts / managing organization ● Semantics - what the elements in a data model represent ISO 11179-3 - registry metamodel and basic attributes carry a standard model of “binding” -- how one associates ontology meaning with both the data element itself and its content. Standard for recording information about data models
  • 29.
    ISO 11179 Modelof meaning / Model of representation
  • 30.
    Current roles incaDSR + NCI Thesaurus
  • 31.
    RDF as thegreat “blender”
  • 32.
    ADM Models Represented using FHIR Metamodel, andgenerated documentation https://fhir.hotecosystem.org/ccdh/fhir/, https://fhir.hotecosystem.org/ccdh/fhir/aliquot.html FHIR as a Modeling Framework
  • 33.
    FHIR into theRDF blender
  • 34.
  • 35.
    Putting it alltogether Model in Google Sheets
  • 36.
    Putting it alltogether -- do we need a unifying representation? Model in Google Sheets
  • 37.
    Acknowledgments Center for BiomedicalInformatics & Information Technology ● Allen Dearry ● Sherri de Coronado ● Erika Kim ● Denise Warzel ● Melissa Cook Samvit Solutions ● Smita Hastak ● Wendy Ver Hoef ● Charles Yaghmour ● Todd Pihl ● Resham Kulkarni Frederick National Laboratory for Cancer Research DCF: Data Commons Framework - Infrastructure HTAN: Human Tumor Atlas Network ICDC: Integrated Canine Data Commons IDC: Imaging Data Commons GDC: Genomic Data Commons PDC: Proteomics Data Commons SevenBridges Gabriella Miller Kids First Data Resource Center CDS: Cancer Data Services Broad Institute FireCloud Institute for Systems Biology NBIA: National Biomedical Imaging Archive SEER Virtual Tissue Repository CIDC: Cancer Immunology Data Commons Cancer Data Aggregator ● Brian O’Connor ● Alex Baumann ● David Pot ● Jack DiGiovanna ● Cara Mason