FAIR as a Basis for the Cancer
Research Data Commons
Ian Fore, D.Phil.
NCI Center for Biomedical Informatics and Information Technology
@ianmfore
Demystifying FAIR Science: Examples, Tools and Use Cases
ISMB/ECCB 2019
July 23 2019
Three themes
• Experience of FAIR
• Capturing diversity of science
• Allow more people to be compliant
• Understanding collaboration
Gartner hype cycle
Under the hype curve
“there are emerging indications that the original meanings of findable,
accessible, interoperable, and reusable sometimes may be stretched”
“the proposed implementation of these principles, with the goal of an
Inter- net of FAIR Data and Services, is beginning to raise concern and
confusion”
Cloudy, increasingly FAIR; revisiting the FAIR Data guiding
principles for the European Open Science Cloud
Mons et al (2017)
Information Services & Use, vol. 37, pp. 49-56
FAIR Evaluation
Desirable characteristics of FAIR Assessment
• Transparency
• What are the criteria
• Who is doing the assessment
• Qualifications for doing so
• Not just a score
• Non judgmental
• Reflect qualities of a given resource
• Not strict
• Important to leave scope for the novelty of science
CC BY-SA 4.0
Case statement of the WG
2019-04-03 www.rd-alliance.org - @resdatall 3
Challenge
Ambiguity and wide range of interpretations of FAIRness
Lack of a common set of core assessment criteria and a minimum set of shared
guidelines
Approach
Bring together stakeholders
Build on existing approaches and expertise
Intended results
RDA Recommendation of core assessment criteria
Generic and expandable self-assessment model
Self-assessment toolset
FAIR data checklist
9
CC BY-SA 4.0
Measuring maturity
Alternative #1
2019-04-03 www.rd-alliance.org - @resdatall 19
Five options for R1.1 [metadata/data]
Level 0: no licence
Level 1: non standard licence in a human-readable format allowing access
Level 2: standard licence in a human-readable format allowing access
Level 3: standard open licence in a human-readable format allowing reuse
Level 4: standard open licence in a machine-readable format allowing reuse
Level 5: standard open licence in a machine-readable format with clear
criteria allowing reuse
Each option is defining a maturity level
Method step 9
Level 0
Level 1
Level 2
Level 3
Level 4
FAIRNess compliance for R1.1
Level 5
11
Data Commons Framework
Clinical Proteomics ImagingGenomics Immuno-
oncology
Cancer Models Biomarkers
NCI Cancer Research
Data Commons
SBG CGC
Broad FireCloud ISB CGC
Elastic
Compute
Query
Visualization
Clinical Proteomics Tumor
Analysis Consortium*
Tool
Deployment
The Cancer Imaging Archive*
TCIA
Web
Interface
APIs Data
Submission
Authentication
& AuthorizationAuthentication
& Authorization
Data Models &
Dictionaries
Computational
Workspaces
Data Contributors and Consumers
Tool
Repositories
Metadata
Validation
& Tools
Analysis
As Is Genomics
GDC and Cloud Resources are
available now; Framework, As Is
Genomics, PDC, IDC are in
development; all else is
notional.
NCBI Sequence Data Delivery Pilot (SDDP)
• Sequence data will be stored in commercial cloud
• Data submission by existing DbGaP mechanisms
• Familiar to investigators
• Minimal change to existing DbGaP sequence submission
• VCF and phenotype observations in DbGaP
• Relieves pressure on SRA by allowing NCI to provide directly
for storage of data from its programs
Data “As Is?”
• Not as harmonized and restructured as e.g. Genomic Data Commons
• Goal:
• Usable by those not engaged in the creation or production of the dataset
• FAIR as a general principle
• Ensure the context of samples is captured
Standards essential but…
• Science inherently non-standard
• Even when standard – large diversity of biomedical data
• Need to understand data standards from other domains
• Domains constantly in flux
• Data models reflect unique study designs
• Harmonization has proven expensive
Shift what we ask for
Not just
“These are the columns you must provide”
“Fit your data into our model”
But also (not strict)
What are the variables that define your study?
What is the model of your data? Standardized or not.
Standardizing the description of the non-
standard
<variable id="phv00357184.v1">
<name>SEX</name>
<description>Sex of participant</description>
<type>string</type>
<value>Female</value>
<value>Male</value>
</variable>
<variable id="phv00169062.v7">
<name>SEX</name>
<description>Sex</description>
<type>integer, encoded value</type>
<comment>Sex The Donor's Identification of
sex based upon self-report, family/next of kin,
or medical record abstraction. </comment>
<value code="1">Male</value>
<value code="2">Female</value>
</variable>
One dbGaP study - GECCO
Another dbGaP study - GTEx
What is needed for standard description of
variables?
<variable id="phv00217659.v3">
<name>SMUBRTRM</name>
<description>Uberon Term, anatomical location as described by the Uber Anatomy
Ontology (UBERON)</description>
<type>string</type>
<comment>Generated at LDACC Term as specified by the Uber Anatomy
Ontology (UBERON),
http://bioportal.bioontology.org/ontologies/UBERON. </comment>
</variable>
<variable id="phv00169242.v7">
<name>SMTSISCH</name>
<description>Total Ischemic time for a sample</description>
<type>integer</type>
<unit>Minutes</unit>
<comment>Sample Ischemic Time Interval between actual death, presumed death,
or cross clamp application and final tissue stabilization. </comment>
</variable>
From GTEX Sample Attributes
What is the standard to describe data
Things, Attributes, Relationships
• Some existing candidates
• Metadata repositories
• ISO11179 style (caDSR, CDISC, etc.)
• RDA Data Type Registries
• GA4GH – Discovery Work Stream – Search/SchemaBlocks
• BDBag
• PFB – Portable Format for Biomedical Data
• JSON-LD
Data Sharing:
It’s not the technology, it’s the
attitude we take
Greg Simon, Director – Cancer Moonshot Task
Force
NIH Institutes Represented in BD2K Radical
Collaboration Training
• Center for Scientific Review
• National Center for Advancing Translational
Sciences
• National Cancer Institute
• National Human Genome Research Institute
• National Heart, Lung, and Blood Institute
• National Institute on Aging
• National Institute of Allergy and Infectious
Diseases
• National Institute of Biomedical Imaging and
Bioengineering
• National Institute of Child Health and Human
Development
• National Institute on Drug Abuse
• National Institute of General Medical Sciences
• National Institute of Mental Health
• National Institute on Minority Health and
Health Disparities
• NIH Office of the Director
Thanks to: Phil Bourne and Warren Kibbe
we wrestle with the question of
whether natural selection inherently
favors selfish behavior. Is the
process of evolutionary competion
cruel, or does it sometimes pay to be
nice?
https://www.wnycstudios.org/story/104082-prisoners-dilemma
FIRO-B
All Human beings share some basic similarities
All people want to
feel:
Significant Competent Likable
All people have some
fear of being:
Ignored Humiliated Rejected
All people have
behavior preferences
about:
Inclusion Control Openness
By creating environments that invite people to feel significant, competent and likable, you
reduce the level of fear and create environments that are more conducive to honesty,
collaboration, accountability and fun! It is in these environments where people bring their best
to the workplace and productivity soars in an atmosphere of trust.
By being accountable for your own internal environment, you take responsibility for your own
psychological patterns and mental models and do not inappropriately associate your own
deeper self-concept with the substantive issues you are working with.
Celeste Blackman, Green Zone Culture Group
Operational phase matters more than the
outset
• FAIR has served to bring the community together
• Success depends on the follow through
• There is no silver bullet
• Fred Brooks
26
Submit Proposal to Build the Cancer Data Aggregator
• Accessible via NCI Cloud Resources and
other applications.
• API layer will allow users to query across:
• NCI Cancer Research Data Commons
(CRDC)
• NCI Data Coordinating Centers
(e.g. HTAN DCC)
• Additional Repositories
(e.g. KidsFirst DRC)
Proposals due by August 15, 2019: https://go.usa.gov/xyKe3

Fore FAIR ISMB 2019

  • 1.
    FAIR as aBasis for the Cancer Research Data Commons Ian Fore, D.Phil. NCI Center for Biomedical Informatics and Information Technology @ianmfore Demystifying FAIR Science: Examples, Tools and Use Cases ISMB/ECCB 2019 July 23 2019
  • 2.
    Three themes • Experienceof FAIR • Capturing diversity of science • Allow more people to be compliant • Understanding collaboration
  • 3.
  • 4.
    Under the hypecurve “there are emerging indications that the original meanings of findable, accessible, interoperable, and reusable sometimes may be stretched” “the proposed implementation of these principles, with the goal of an Inter- net of FAIR Data and Services, is beginning to raise concern and confusion” Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud Mons et al (2017) Information Services & Use, vol. 37, pp. 49-56
  • 5.
  • 6.
    Desirable characteristics ofFAIR Assessment • Transparency • What are the criteria • Who is doing the assessment • Qualifications for doing so • Not just a score • Non judgmental • Reflect qualities of a given resource • Not strict • Important to leave scope for the novelty of science
  • 8.
    CC BY-SA 4.0 Casestatement of the WG 2019-04-03 www.rd-alliance.org - @resdatall 3 Challenge Ambiguity and wide range of interpretations of FAIRness Lack of a common set of core assessment criteria and a minimum set of shared guidelines Approach Bring together stakeholders Build on existing approaches and expertise Intended results RDA Recommendation of core assessment criteria Generic and expandable self-assessment model Self-assessment toolset FAIR data checklist
  • 9.
    9 CC BY-SA 4.0 Measuringmaturity Alternative #1 2019-04-03 www.rd-alliance.org - @resdatall 19 Five options for R1.1 [metadata/data] Level 0: no licence Level 1: non standard licence in a human-readable format allowing access Level 2: standard licence in a human-readable format allowing access Level 3: standard open licence in a human-readable format allowing reuse Level 4: standard open licence in a machine-readable format allowing reuse Level 5: standard open licence in a machine-readable format with clear criteria allowing reuse Each option is defining a maturity level Method step 9 Level 0 Level 1 Level 2 Level 3 Level 4 FAIRNess compliance for R1.1 Level 5
  • 11.
    11 Data Commons Framework ClinicalProteomics ImagingGenomics Immuno- oncology Cancer Models Biomarkers NCI Cancer Research Data Commons SBG CGC Broad FireCloud ISB CGC Elastic Compute Query Visualization Clinical Proteomics Tumor Analysis Consortium* Tool Deployment The Cancer Imaging Archive* TCIA Web Interface APIs Data Submission Authentication & AuthorizationAuthentication & Authorization Data Models & Dictionaries Computational Workspaces Data Contributors and Consumers Tool Repositories Metadata Validation & Tools Analysis As Is Genomics GDC and Cloud Resources are available now; Framework, As Is Genomics, PDC, IDC are in development; all else is notional.
  • 12.
    NCBI Sequence DataDelivery Pilot (SDDP) • Sequence data will be stored in commercial cloud • Data submission by existing DbGaP mechanisms • Familiar to investigators • Minimal change to existing DbGaP sequence submission • VCF and phenotype observations in DbGaP • Relieves pressure on SRA by allowing NCI to provide directly for storage of data from its programs
  • 13.
    Data “As Is?” •Not as harmonized and restructured as e.g. Genomic Data Commons • Goal: • Usable by those not engaged in the creation or production of the dataset • FAIR as a general principle • Ensure the context of samples is captured
  • 14.
    Standards essential but… •Science inherently non-standard • Even when standard – large diversity of biomedical data • Need to understand data standards from other domains • Domains constantly in flux • Data models reflect unique study designs • Harmonization has proven expensive
  • 15.
    Shift what weask for Not just “These are the columns you must provide” “Fit your data into our model” But also (not strict) What are the variables that define your study? What is the model of your data? Standardized or not.
  • 16.
    Standardizing the descriptionof the non- standard <variable id="phv00357184.v1"> <name>SEX</name> <description>Sex of participant</description> <type>string</type> <value>Female</value> <value>Male</value> </variable> <variable id="phv00169062.v7"> <name>SEX</name> <description>Sex</description> <type>integer, encoded value</type> <comment>Sex The Donor's Identification of sex based upon self-report, family/next of kin, or medical record abstraction. </comment> <value code="1">Male</value> <value code="2">Female</value> </variable> One dbGaP study - GECCO Another dbGaP study - GTEx
  • 17.
    What is neededfor standard description of variables? <variable id="phv00217659.v3"> <name>SMUBRTRM</name> <description>Uberon Term, anatomical location as described by the Uber Anatomy Ontology (UBERON)</description> <type>string</type> <comment>Generated at LDACC Term as specified by the Uber Anatomy Ontology (UBERON), http://bioportal.bioontology.org/ontologies/UBERON. </comment> </variable> <variable id="phv00169242.v7"> <name>SMTSISCH</name> <description>Total Ischemic time for a sample</description> <type>integer</type> <unit>Minutes</unit> <comment>Sample Ischemic Time Interval between actual death, presumed death, or cross clamp application and final tissue stabilization. </comment> </variable> From GTEX Sample Attributes
  • 18.
    What is thestandard to describe data Things, Attributes, Relationships • Some existing candidates • Metadata repositories • ISO11179 style (caDSR, CDISC, etc.) • RDA Data Type Registries • GA4GH – Discovery Work Stream – Search/SchemaBlocks • BDBag • PFB – Portable Format for Biomedical Data • JSON-LD
  • 19.
    Data Sharing: It’s notthe technology, it’s the attitude we take Greg Simon, Director – Cancer Moonshot Task Force
  • 22.
    NIH Institutes Representedin BD2K Radical Collaboration Training • Center for Scientific Review • National Center for Advancing Translational Sciences • National Cancer Institute • National Human Genome Research Institute • National Heart, Lung, and Blood Institute • National Institute on Aging • National Institute of Allergy and Infectious Diseases • National Institute of Biomedical Imaging and Bioengineering • National Institute of Child Health and Human Development • National Institute on Drug Abuse • National Institute of General Medical Sciences • National Institute of Mental Health • National Institute on Minority Health and Health Disparities • NIH Office of the Director Thanks to: Phil Bourne and Warren Kibbe
  • 23.
    we wrestle withthe question of whether natural selection inherently favors selfish behavior. Is the process of evolutionary competion cruel, or does it sometimes pay to be nice? https://www.wnycstudios.org/story/104082-prisoners-dilemma
  • 24.
    FIRO-B All Human beingsshare some basic similarities All people want to feel: Significant Competent Likable All people have some fear of being: Ignored Humiliated Rejected All people have behavior preferences about: Inclusion Control Openness By creating environments that invite people to feel significant, competent and likable, you reduce the level of fear and create environments that are more conducive to honesty, collaboration, accountability and fun! It is in these environments where people bring their best to the workplace and productivity soars in an atmosphere of trust. By being accountable for your own internal environment, you take responsibility for your own psychological patterns and mental models and do not inappropriately associate your own deeper self-concept with the substantive issues you are working with. Celeste Blackman, Green Zone Culture Group
  • 25.
    Operational phase mattersmore than the outset • FAIR has served to bring the community together • Success depends on the follow through • There is no silver bullet • Fred Brooks
  • 26.
    26 Submit Proposal toBuild the Cancer Data Aggregator • Accessible via NCI Cloud Resources and other applications. • API layer will allow users to query across: • NCI Cancer Research Data Commons (CRDC) • NCI Data Coordinating Centers (e.g. HTAN DCC) • Additional Repositories (e.g. KidsFirst DRC) Proposals due by August 15, 2019: https://go.usa.gov/xyKe3

Editor's Notes

  • #5 FAIR Findable, Accessible, Interoperable, Reusable
  • #12 Building upon this foundation, our vision for a Cancer Research Data Commons: Is a virtual, expandable infrastructure that will support collaboration among researchers, clinicians, patients, computational scientists, and tool developers. It will house multiple cloud-based Commons Nodes for multiple data types initially including: Genomic, Imaging, and Proteomics data. In the future, we envision additional nodes that support other data types. ANIMATE: The genomics data node is built upon the foundation of the GDC, which provides a means of data submission, user interfaces, and visualization tools. ANIMATE: “As is genomics” is similar, in principle, to the Sequence Read Archive within dbGaP. ANIMATE: The Cloud Resources provide search, compute, and analytical resources, as well as a way for researchers to use their own data and tools. ANIMATE: The Data Commons Framework provides secure access, an approach to metadata validation, user workspaces, shared data models and dictionaries to facilitate interoperability, and a tool repository to allow users to use and share new algorithms, tools, pipelines, and visualizations. ANIMATE: The Proteomics Data node will incorporate data from the Clinical Proteomics Tumor Analysis Consortium (CPTAC) and other sources. ANIMATE: The Imaging Data node will work with the Cancer Imaging Archive (TCIA) and other sources of imaging data. In terms of status, GDC and Cloud Resources are operational; Framework, As Is Genomics, PDC, and IDC are in development; and more is planned for the future.
  • #16 There should be a different obligation - to require the investigator to share what they know rather than a fixed set of attributes. This should be more motivating to the investigator as it unlocks the creativity inherent in how they designed their study. “Metadata” is a seen as a burden Can we lessen the burden by making sure we capture what scientists are doing anyway?
  • #18 Two examples here 1. A variable that uses an ontology. How could the term be more explicity referenced? 2. A numeric variable where the type (integer) and unit are specified.