Grafana in space: Monitoring Japan's SLIM moon lander in real time
Introduction to CCDH Joint Meeting Recap
1. An Introduction to
CCDH
Joint meeting of the CRDC & the Center for Cancer Data Harmonization
Date: June 29, 2020
https://datascience.cancer.gov/data-commons/center-cancer-data-harmonization-ccdh
These slides: bit.ly/ccdh-crdc-june-2
2. These slides:
bit.ly/ccdh-crdc-june-2
Joint meeting of the CRDC & the Center for Cancer Data Harmonization
Date: June 29, 2020
https://datascience.cancer.gov/data-commons/center-cancer-data-harmonization-ccdh
3. Outline
● Synthesis of information from CRDC and insights
derived | Sam, Melissa
● Presentation of Harmonized Data Model | Brian & Matt
● Ontology landscape and terminological requirements |
Jim, Harold, Dazhi
4. Community
Development
(Lead: Volchenboum;
Co-Lead: Vasilevsky)
Data Model
harmonization
(Lead: Chute;
Co-Lead: Furner)
Ontology &
Terminology
Ecosystem
(Lead: Solbrig)
Tools & Data Quality
(Lead: Balhoff)
Programmatic oversight
CBIIT: Sherri De Coronado, Allen Dearry
FNL: Todd Pihl, Resham Kulkarni
Program Management and operations
(Lead: Haendel, Co-Lead: Munoz-Torres)
5. Role of CCDH in the CRDC ecosystem
Facilitate retrospective and prospective
semantic harmonization of data across
nodes of the CRDC
Coordinate the community to ensure quality
“fit for purpose” design and implementation
of standards that will facilitate
interoperability of heterogeneous data
types and CRDC resources
Find agreement across the communities
built around CRDC
- match and extend data models
- annotation, harmonization
- quality assurance
6. Data Model
harmonization
(Lead: Chute
Co-Lead: Furner)
Ontology &
Terminology
Ecosystem
(Lead: Solbrig)
Tools & Data
Quality
(Lead: Balhoff)
Schema to
schema
OMOP to
FHIR
Term to
Term
Oncotree to
NCIt
Data records to
data records
“Smoking status 3
packs per day” to
NCIT:C154510
[Heavy Smoker]
8. Community Development Working Group
Goals:
● Engage CRDC stakeholders: interviews to identify and document semantic priorities
● Document current platforms
● Develop plans to support core semantic standards and concierge services
Completedinterviews
DCF: Data Commons Framework -
Infrastructure
Node
HTAN: Human Tumor Atlas Network
ICDC: Integrated Canine Data Commons
IDC: Imaging Data Commons
GDC: Genomic Data Commons
PDC: Proteomics Data Commons
Futureinterviews
Gabriella Miller Kids First Data Resource
Center
Node
CDS: Cancer Data Services
Broad Institute FireCloud
Institute for Systems Biology
SevenBridges
NBIA: National Biomedical Imaging
Archive
SEER Virtual Tissue Repository
CIDC: Cancer Immunology Data
Commons
Summary matrix from initial interviews
9. Community Development - Phase II - Pilot
● Provision of help desk services (office hours and GitHub issue tracker)
● Data preparation services
○ mapping and transformations of terminologies and models
○ metadata validation
○ data annotation
● Web portal development
● Work with the nodes to assist mapping and transformation of data
● Develop user support documentation and materials
Main user base is the node developers But these users will also benefit
10. Establish a
transparent
process for
community
discussion,
modification,
and acceptance
of new or
modified
content (GitHub)
Community Development - Phase III/IV - Production and Operations
Concierge
services for
CRDC nodes,
DCC, DCF,
other end users
Continue
collecting user
questions and
feedback to
improve
services and
identify user
needs and pain
points
Enable the
users to find the
resources they
need and to be
able to use the
portal
independently
Web portal
enhancements /
load testing
Unit tests / QC
12. ● Will provide a single data model that harmonizes
syntax and semantics across the CRDC systems
and services.
● This CRDC-H model will enable data
aggregation and exchange to facilitate
integrated search, navigation, and
metadata-based analysis
● We will align with community standards where
possible (e.g. FHIR, BRIDG) to promote broader
interoperability, and leverage mappings and
tools provided by these efforts
Data Model Harmonization: Overview
Ecosystem of CRDC repositories,
services and stakeholders
13. 1. Standardize Source Data
Model Documentation
2. Generate an Aggregated
Data Model (ADM)
3. Map the ADM to
Community Standard Data
Models
4. Refactor the ADM into a
Conceptual Domain
Model (CDM)
5. Refactor the CDM to a
Logical Data Model
(CRDC-H) An iterative process through which source model content is evaluated, aggregated,
mapped, and refactored into a standards- aligned and harmonized data model.
CRDC-H Model Development Workflow
Abstract specification
Low harmonization
Not standards-aligned
Concrete specification
Deep harmonization
Standards-aligned
14. ● Targeted four source models
(GDC, PDC, ICDC, HTAN)
● Focused on Biospecimen
and Administrative
subdomains
● Harmonized entities and
attributes, not data types or
value sets/terminologies
● Informed by BRIDG and FHIR
standards
● Produced an exploratory
conceptual model (does not
yet support implementation) Lessons learned from this narrow but deep dive will inform subsequent
iterations that incorporate new data sources and subdomains.
Phase I: CDM Prototype Development
Abstract specification
Low harmonization
Not standards-aligned
Concrete specification
Deep harmonization
Standards-aligned
15. The Aggregated Data Model (ADM)
GDC
26 entities,
561 attributes
ADM
55 entities,
984 attributes
PDC
27 entities,
500 attributes
ICDC
27 entities,
265 attributes
16. The Aggregated Data Model (ADM)
A substrate
for
refactoring
into more
deeply
harmonized
models
Node models are not well aligned at the outset
● e.g. ICDC and GDC: ~30% entity equivalence , <5% attribute equivalence
Property aggregation in the ADM is based on superficial analysis and strict
aggregation criteria - so harmonization is minimal
● Only strictly equivalent elements within strictly equivalent entities are merged
Deeper aggregation and harmonization of elements will be achieved as the
ADM is refactored into the CDM
● Terminological - e.g. GDC 'Treatment' vs ICDC 'Agent Administration’
● Structural - e.g. ICDC provides a more normalized model for clinical metadata
● Semantic - e.g. harmonizing disease terminologies used across systems / species
● Precision - e.g. variable detail provided about tumor staging across models
17. High-Level
Structural
Changes
Resulting
from ADM
Refactoring
(Biospecimen
Subdomain)
The Conceptual Domain Model (CDM) Prototype
1. Specialization: Specialized specimen subtypes in the ADM get collapsed
2. Normalization: Data elements get distributed across a larger set of entities
3. Harmonization: Refactoring reduces total number of properties by half
ADM
refactoring
144 specimen
properties in
total
CDM
74 specimen
properties in
total
Refactoring results in a much more normalized and deeply harmonized CDM model
19. CDM Data
Dictionary
(link)
● The CDM prototype is specified as a spreadsheet-based data dictionary
● Entities and their Attributes are each described in a separate sheet
● Cardinality of attributes is specified to be as permissive as possible initially
● Data Types are minimally specified
○ Simple: declared only at a high level (limited to literal, boolean)
○ Complex: proposals for Identifier, Coding, DateTime, Quantity, . . .
● A ‘Referenced Entities’ sheet lists entities that are referenced in CDM relationships,
but are not in scope to model in this phase of work.
○ e.g. Organization, Visit, ConditionDiagnosis
● A ‘Data Containers’ sheet holds placeholders for objects that will be defined to group
sets of related properties (specific structures for these t.b.d.)
● Mappings of several types are also provided in the main Entity sheets:
○ ADM attributes that map to each CDM attribute (column L)
○ Source node attributes aggregated by these ADM attributes (column M)
○ CDM to FHIR mappings (column N)
The Conceptual Domain Model (CDM)
20. The Conceptual Domain Model (CDM)
Excerpt from the ‘Specimen’ sheet of the CDM Data Dictionary (link)
Attribute Definitions Mappings
21. BRIDG bridgmodel.nci.nih.gov
● A detailed and highly-normalized conceptual model
covering the domains of clinical and translational research
(a mapping ‘hub’, not an implementation model)
● ADM mappings to BRIDG support a deeper
understanding of source model elements, keep our data
model grounded in reality, and enable cross-mapping to
other BRIDG-mapped models
FHIR hl7.org/fhir
● A data exchange model and API framework covering
patient-level healthcare information generated in EHRs
● ADM mappings to FHIR provide a pragmatic target to
guide ADM->CDM refactoring, as alignment can enable
interoperability with clinical data systems, and potentially
lets us leverage FHIR infrastructure and tools
Mapping CCDH Models to BRIDG and FHIR
Mappings from Sources and the CDM to BRIDG and FHIR can be derived from ADM mappings to each of these models
BiologicSpecimen <--beAFunctionPerformedBy-- Subject
<--beParticipatedInBy-- PerformedMaterialProcessStep.methodCode
WHERE PerformedMaterialProcessStep --instantiate→
DefinedMaterialProcessStep.nameCode="freeze"
BRIDG mapping path for ADM.Sample.freezing_method:
FHIR elements required to represent ADM Sample
22. ● Test / validate the CDM
prototype against node data,
competency questions, and
feedback from stakeholders.
● Incorporate additional CRDC
source models into the ADM
(e.g. IDC) (Steps 1 and 2)
Phase II Activities: Multiple Workstreams in Parallel
● Refactor additional ADM subdomains into the CDM (e.g. clinical metadata) (Steps 3 and 4)
● Evolve mature CDM content into an implementable logical model (Step 5)
● Terminological / value set harmonization
23. Key CCDH Modeling Work Products
ID Name Description
Archived
Document
WP0 May 2020 Phase 1 Report Short document describing work performed and products generated in this phase of work. gdoc
WP1 BRIDG and FHIR Mappings A spreadsheet with detailed and provenanced mappings of ADM elements to BRIDG and FHIR xls
WP2 BRIDG and FHIR Covering Model Diagrams UML-like views of elements in the BRIDG and FHIR models required to represent ADM entities. pdf
WP3 CDM Entity and Attribute Diagram A class diagram providing a high-level view of the CDM pdf
WP5 CDM Dictionary (and Mappings)
A data dictionary spreadsheet detailing the Conceptual Domain Model, and its attribute-level mappings to the
ADM and FHIR.
gsheets
WP6 ADM Representation in FHIR A representation of ADM entities using FHIR metamodeling language and tooling gdoc
ID Name Description
Archived
Document
WP1 CRDC Node concept maps A side-by-side view of the core models implemented by GDC, PDC, and ICDC nodes. png
WP3 CRDC Data Model Dictionaries One document with separate spreadsheets for GDC, PDC, and ICDC models. gsheets
WP5 Aggregated Model Concept Map A high level view of the entities and relationships in the aggregated model. png
WP6 Aggregated Data Dictionary Spreadsheets describing all elements from the Aggregated Model, and mappings to source elements. gsheets
May 2020 Deliverable Package
February 2020 Deliverable Package
25. Delivering terminological & data model content to support data
ingest / data harmonization within each node
● Provide tools to facilitate use of the harmonized data model and terminology by
nodes
○ Harmonized data and terminologies enable access to data via CDA
● Metadata validation leveraging the harmonized terminology
● Mapping incoming datasets to the harmonized model
● Migration across harmonized model versions
● Leverage existing tools, existing terminologies, where possible
Behind every data model are the tools and terminologies that make it work
26. Terminology tools and services landscape assessment
What already exists? What can be best utilized or adapted for the CRDC? What are the gaps?
Admin/Access
Licensing
Registration
Authentication
Publication
Version management
Change management
Automated updates
UI/Browse/Search
Term search/Autocomplete
UI for navigation
Querying, filtering
Synonym support
Visualization
Community use indicated/tracked
API
Standard
Named entity recognition
Validation
Transitive closure
Identifiers
URIs
Dereferencing
Mapping
Serves maps
Map curation and authoring
Map validation
Value set services
Formats
Semantic typing
Inputs, outputs, OWL2, etc.
27. Data annotation and QC tools
What already exists? What can be best utilized or adapted for the CRDC? What are the gaps?
Mapping and Transformation
standardization
NLP/named entity recognition
semantic similarity
Metadata Validation and QC
value sets
logical constraints
syntax
Data Annotation
template building
term search
terminology browsing
Examples
CEDAR
Ptolemy.V
Metadata Validation Service
Simple Terminology Server
FHIR Terminology Server
OpenRefine
RDF shapes (ShEx/SHACL)
28. ISO 11179 Metadata Registries (MDR)
● Provenance / history
● Contacts / managing organization
● Semantics - what the elements in a data model represent
ISO 11179-3 - registry metamodel and basic attributes carry a standard model of
“binding” -- how one associates ontology meaning with both the data element itself
and its content.
Standard for recording information about data models
32. ADM Models
Represented
using FHIR
Metamodel,
and generated
documentation
https://fhir.hotecosystem.org/ccdh/fhir/, https://fhir.hotecosystem.org/ccdh/fhir/aliquot.html
FHIR as a Modeling Framework
36. Putting it all together -- do we need a unifying representation?
Model in Google Sheets
37. Acknowledgments
Center for Biomedical Informatics &
Information Technology
● Allen Dearry
● Sherri de Coronado
● Erika Kim
● Denise Warzel
● Melissa Cook
Samvit Solutions
● Smita Hastak
● Wendy Ver Hoef
● Charles Yaghmour
● Todd Pihl
● Resham Kulkarni
Frederick National Laboratory
for Cancer Research
DCF: Data Commons Framework - Infrastructure
HTAN: Human Tumor Atlas Network
ICDC: Integrated Canine Data Commons
IDC: Imaging Data Commons
GDC: Genomic Data Commons
PDC: Proteomics Data Commons
SevenBridges
Gabriella Miller Kids First Data Resource Center
CDS: Cancer Data Services
Broad Institute FireCloud
Institute for Systems Biology
NBIA: National Biomedical Imaging Archive
SEER Virtual Tissue Repository
CIDC: Cancer Immunology Data Commons
Cancer Data
Aggregator
● Brian O’Connor
● Alex Baumann
● David Pot
● Jack DiGiovanna
● Cara Mason