SlideShare a Scribd company logo
1 of 92
An Overview of Cancer
Data Repositories
May 24, 2023 Warren A. Kibbe, Ph.D.
Vice Chair & Professor of
Biostatistics and Bioinformatics
Duke University School of Medicine
warren.kibbe@duke.edu
We will take a tour of:
AACR Project GENIE
NCI Genomic Data Commons
The Gabriela Miller Kids First Data Portal
dbGaP
The Human BioMolecular Atlas Program (HuBMAP)
Human Tumor Atlas Network (HTAN)
CIViC DB
MyCancerGenome
COSMIC
Childhood Cancer Data Catalog
5/24/2023
By no means are these the only or even the most important ones!
Our ability to generate biomedical data
continues to grow in terms of variety and
volume
Current sources of data
molecular genome pathology imaging labs notes sensors
icons by the Noun Project
Things to think about
5/24/2023
Many different metaphors for
building a data repository
Many different methods for making
data FAIR (Findable, Accessible,
Interoperable, Reusable)
Data wrangling – organizing,
formatting, harmonizing,
semantically annotating – is hard
work and different resources take
different approaches
Usability is dependent on the use
case, the tools, the problem
domain
5
Design Considerations
WHERE DO YOU
WANT FLEXIBILITY?
FILE FORMATS?
DATA TYPES?
DEFINITIONS?
HOW MUCH CAN
YOU ANTICIPATE
AT THE
BEGINNING?
WHERE IS IT
LIKELY THAT YOUR
REQUIREMENTS
WILL CHANGE?
WHO ARE YOUR
USERS?
BIOLOGISTS?
CHEMISTS?
CLINICIANS?
QUANTS?
WILL THEY COME
TO EXPLORE
(HYPOTHESIS
GENERATION)?
WILL THEY BRING
AN EXISTING
QUESTION
(HYPOTHESIS
ANSWERING)?
DO THEY WANT TO
USE THEIR OWN
ANALYSIS TOOLS?
DO THEY WANT TO
BUILD NEW TOOLS
TO EXPLORE THE
DATA?
What is the right data resource to use?
5/24/2023
What is the question you
want to answer?
Are you exploring how to
answer the problem or
ready to analyze?
Do you need to explore a
dataset to understand if
the data is available?
Does the resource support
the kind of exploration you
want to do?
Do you have an analysis
plan and workflow
defined?
Are your analysis tools
appropriate for the
resource you want to use?
7
What is the right data resource to use?
Does it have the right kind of data? (clinical trial data, scRNA-seq, specific cell line,
disease area, model system)
What are the access policies? (unrestricted, controlled access)
Level of security required (none, limited, regulated [think HIPAA], very sensitive)
Ability to peruse the metadata?
Can you download the data?
If it is an enclave/closed ecosystem, what tools are supported?
Can you bring your own tools?
A useful catalog of data resources
5/24/2023
https://datacatalog.ccdi.cancer.gov
Childhood Cancer Data Catalog: TCIA projects
5/24/2023
A short scenario
You are interested in exploring non-small cell lung cancer and
identifying the top 5 mutations. You want plot either
progression free survival or a survival curve for each mutation.
Possible data sources: AACR Project GENIE and the NCI
Genomic Data Commons (part of the NCI Cancer Research
Data Commons)
AACR Project GENIE cBioPortal resource
5/24/2023
https://genie.cbioportal.org
Start with Cohort 13.1 – data from 167,358 samples
AACR Project GENIE cbioportal resource
5/24/2023
https://genie.cbioportal.org
Select the 24,110 NSCLC
AACR Project GENIE cbioportal resource
5/24/2023
https://genie.cbioportal.org
NSLC View
AACR Project GENIE cbioportal resource
5/24/2023
https://genie.cbioportal.org
Lung
Adenomacarcinoma
18,430 cases
AACR Project GENIE cbioportal resource
5/24/2023
https://genie.cbioportal.org
‘Manhattan’ plot of mutations in TP53
AACR Project GENIE cBioPortal resource
5/24/2023
https://genie.cbioportal.org
Not a survival
analysis!
Looks like patients
with somatic TP53
mutations die
younger
Many potential
confounders!
cBioPortal makes plotting survival data easy – but not with
GENIE data! This plot is men vs women with bladder cancer
5/24/2023
https://cbioportal.org
Now we will explore the NCI Genomic Data Commons
5/24/2023 https://portal.gdc.cancer.gov
Genomic Data Commons
5/24/2023 https://portal.gdc.cancer.gov
Lung cancer – 12,262 cases
Genomic Data Commons
5/24/2023 https://portal.gdc.cancer.gov
Lung cancer - 12,262 cases
Only 1,475 have survival data
Genomic Data Commons
5/24/2023 https://portal.gdc.cancer.gov
Genes view
Genomic Data Commons
5/24/2023 https://portal.gdc.cancer.gov
Genes view
Notice the case numbers
TP53
Genomic Data Commons
5/24/2023 https://portal.gdc.cancer.gov
Genes view Notice the case numbers
TTN
Genomic Data Commons
5/24/2023 https://portal.gdc.cancer.gov
Genes view Notice the case numbers
MUC16
Genomic Data Commons
5/24/2023 https://portal.gdc.cancer.gov
Genes view Notice the case numbers
CSMD3
Genomic Data Commons
5/24/2023 https://portal.gdc.cancer.gov
Genes view Notice the case numbers
RYR2
Genomic Data Commons
5/24/2023 https://portal.gdc.cancer.gov
Genes view Notice the case numbers
USH2A
Genomic Data Commons
5/24/2023 https://portal.gdc.cancer.gov
Genes view Notice the case numbers
LRP1B
Genomic Data Commons
5/24/2023 https://portal.gdc.cancer.gov
Genes view Notice the case numbers
ZFHX4
Genomic Data Commons
5/24/2023 https://portal.gdc.cancer.gov
Genes view Notice the case numbers
XIRP2
Survival analysis out of the box
 The fact that mutations in all of those genes appears to convey a
survival advantage is puzzling and contradicts the simple longevity
curve we got from cBioPortal in GENIE
 Strange drop in survival at about 5 years
 Need a lot more analysis before believing the plot!
5/24/2023
Revisit: AACR Project GENIE cBioPortal resource
5/24/2023
https://genie.cbioportal.org
Not a survival
analysis!
Looks like patients
with somatic TP53
mutations die
younger
Many potential
confounders!
Survival analysis out of the box
 The fact that mutations in all of those genes appears to convey a
survival advantage is puzzling and contradicts the simple longevity
curve we got from cBioPortal in GENIE
 There are multiple potential explanations
 TP53 mutations arise in younger patients and their survival with disease
is longer (maybe?)
 There are one or more confounders in the TCGA data, such as a skew
toward late stage disease for patients with TP53 mutations (and the
other mutations as well!) (definitely!)
 Patients are from non-comparable cohorts (probably)
 Other confounders you can think of?
5/24/2023
Survival analysis out of the box
 These problems and having consistent data entry from all the cases is
why clinical trials are the standard for interpreting Progression Free
Survival and performing survival analysis
 This is a taste of some of the issues in interpreting real world data –
knowing that the cases and controls are balanced for known
confounders
 Knowing which questions are appropriate for a given dataset is critical
5/24/2023
35
Confounders
A story about lung cancer and exercise
Sometimes we overlook the obvious
Don’t forget to include known confounders!
Background on TCGA and GENIE
 The Cancer Genome Atlas (TCGA) is a long running study of all
comers across many institutions, however specific institutions
contributed specimens and data predominantly from one disease site
 TCGA is biased toward larger tumors and stage 3
 TCGA was research sequencing, so not with therapeutic intent
 TCGA is primarily research whole exome sequencing
 GENIE is also an ‘all comers’ study. Sequence comes from a variety of
in house and commercial clinical sequencing platforms. Primarily
targeted sequencing. Based on perceived clinical benefit for the patient
5/24/2023
More on TCGA and GENIE
 Follow-up, therapeutic lines of treatment, PFS, recurrence are not
consistently captured
 Diagnosis and stage at specimen collection are fairly consistent
5/24/2023
Good questions for TCGA and GENIE
 Prevalence of mutations in a disease area
 Mutational frequency at time of diagnosis
 Mutational patterns at time of diagnosis
 Age at diagnosis
5/24/2023
Beyond TCGA and GENIE
5/24/2023
The next generation repository will need to carefully capture lines of therapy,
multiple disease measures, tumor evolution, and more detailed biomarker
measurements like scRNAseq, proteomics, metabolomics, tumor microenvironment
Deep characterization of normal tissues, The Human BioMolecular Atlas Program
(HuBMAP) https://commonfund.nih.gov/hubmap. For cancer progression from
precancerous lesions through to advanced disease The Human Tumor Atlas
Network (HTAN) https://humantumoratlas.org HuBMAP and HTAN are laying the
foundation for the next stage of repositories
HuBMAP https://portal.hubmapconsortium.org
5/24/2023
HuBMAP https://portal.hubmapconsortium.org
5/24/2023
HTAN https://humantumoratlas.org
5/24/2023
HTAN https://humantumoratlas.org
5/24/2023
Data Access
5/24/2023
For the example analyses, we used the public (open
access) tier of data from Project GENIE and TCGA.
For detailed sequence and clinical data you generally
need to have restricted access
For NIH studies, restricted access data requires
submitting a dbGaP access request
dbGaP
5/24/2023
dbGaP – Access to Kids First data
5/24/2023
An auxiliary way to look at dbGaP projects
5/24/2023
Also doesn’t help see variable names, but does list file types
dbGaP and access to metadata
 With the exception of seeing the number of participants, things like
what types of experimental variables (targeted sequencing, whole
exome, whole genome, RNAseq, scRNAseq, methylSeq, etc), clinical
variables (diagnosis, stage, treatments, adverse events), demographic
variables (age, race, ethnicity, urban vs rural,etc) are not available for
many projects until you have access to the full dataset.
 This makes it hard to identify which specific datasets you might be
interested in
5/24/2023
To understand Kids First, go to it First!
5/24/2023 https://kidsfirstdrc.org
Some stats about Kids First, even before creating a login
5/24/2023
https://kidsfirstdrc.org
Kids First login or sign up
5/24/2023
You can use an ORCID or a google account
You do have to accept the terms & conditions
5/24/2023
You can see the studies
5/24/2023
Drill down on the childhood cancer studies
5/24/2023
Select Neuroblastoma – 1681 participants
5/24/2023
Quick visual breakdown, including survival analysis
5/24/2023
Looking at variants, more than 280 million of them!
5/24/2023
Select gene, then enter TP53
5/24/2023
Now we see the variants for TP53
5/24/2023
Now select for pathogenic variants – 6 of them
5/24/2023
Add likely pathogenic – now we have 8 variants
5/24/2023
62
TP53 is not druggable: Lets look at something clinically actionable
EGFR T790M is a common resistance mutation for patients with
NSLC getting Tyrosine Kinase Inhibitors (TKIs) like gefitinib and
erlotinib, which are small molecule targeted therapies.
We are now going to start looking at EGFR.
63
ClinVar
64
ClinVar
Hard to interpret!
The COSMIC Database
https://cancer.sanger.ac.uk/cosmic
Very different purpose – it is to catalog somatic mutations
associated with cancer
The COSMIC Database
https://cancer.sanger.ac.uk/cosmic
COSMIC resistance mutations
https://cancer.sanger.ac.uk/cosmic/drug_resistance
COSMIC resistance mutations
https://cancer.sanger.ac.uk/cosmic/drug_resistance
COSMIC view of mutations in and around EGFR
COSMIC view of mutations around EGFR
COSMIC view of EGFR
https://cancer.sanger.ac.uk/cosmic/gene/analysis?ln=EGFR
https://www.mycancergenome.org
EGFR in MyCancerGenome
EGFR in MyCancerGenome
Most common cancers with EGFR Mutations in GENIE
Drugs in MyCancerGenome – selectingTKIs
gefitinib in MyCancerGenome
What about exclusion biomarkers like
EGFR T790M?
Going back to the NCI Genomic Data Commons
https://portal.gdc.cancer.gov
Look for EGFR
Look for EGFR
EGFR T790M mutations arise as resistance mutations to 1st or 2nd gen TKI therapy
So it is not common in patients/cancers who do not get these therapies
Look for EGFR
CIViC DB
Similar to ClinVar,
ClinGen,
MyCancerGenome
CIViC DB
CIViC DB – EGFRT790M
Mutation effect in
the description is
clear
Mutation effect on
these drugs is not
clear
CIViC DB – erlotinib
Mutation effect on
erlotinib is not clear
In conclusion
5/24/2023
Many different metaphors for
building a data repository
Many different methods for
making data FAIR (Findable,
Accessible, Interoperable,
Reusable)
Data wrangling – organizing,
formatting, harmonizing,
semantically annotating – is hard
work and different resources take
different approaches
Usability is dependent on the use
case, the tools, the problem
domain
What is the right data resource to use?
5/24/2023
What is the question you
want to answer?
Are you exploring how to
answer the problem or ready
to analyze?
Do you need to explore a
dataset to understand if the
data is available? Does the
resource support the kind of
exploration you want to do?
Do you have an analysis plan
and workflow defined?
Are your analysis tools
appropriate for the resource
you want to use?
88
What is the right data resource to use?
Does it have the right kind of data? (clinical trial data, scRNA-seq, specific cell line, disease
area, model system)
What are the access policies? (unrestricted, controlled access)
Level of security required (none, limited, regulated [think HIPAA], very sensitive)
Ability to peruse the metadata?
Can you download the data?
If it is an enclave/closed ecosystem, what tools are supported?
Can you bring your own tools?
Just ask yourself a few pertinent questions as you use them!
There are many resources out there to use
Thank you!
Thank you!
91
Questions?
What is Real World Data?
Collected in the
context of patient
care. Real World
Data was called
out as part of the
21st Century Cures
Act
21st Century Cures Act: https://www.fda.gov/regulatory-information/selected-amendments-fdc-act/21st-century-cures-act
Graphic from HealthCatalyst: https://www.healthcatalyst.com/insights/real-world-data-chief-driver-drug-development

More Related Content

Similar to Big Data Training for Cancer Research, Purdue, May 2023

Nci clinical genomics data sharing ncra sept 2016
Nci clinical genomics data sharing ncra sept 2016Nci clinical genomics data sharing ncra sept 2016
Nci clinical genomics data sharing ncra sept 2016Warren Kibbe
 
Enabling Translational Medicine with e-Science
Enabling Translational Medicine with e-ScienceEnabling Translational Medicine with e-Science
Enabling Translational Medicine with e-ScienceOla Spjuth
 
The BRCA Challenge & Exchange: Progress and Plans - Gunnar Rätsch
The BRCA Challenge & Exchange: Progress and Plans - Gunnar RätschThe BRCA Challenge & Exchange: Progress and Plans - Gunnar Rätsch
The BRCA Challenge & Exchange: Progress and Plans - Gunnar RätschHuman Variome Project
 
Data Commons & Data Science Workshop
Data Commons & Data Science WorkshopData Commons & Data Science Workshop
Data Commons & Data Science WorkshopWarren Kibbe
 
coad_machine_learning
coad_machine_learningcoad_machine_learning
coad_machine_learningFord Sleeman
 
ICBO 2014, October 8, 2014
ICBO 2014, October 8, 2014ICBO 2014, October 8, 2014
ICBO 2014, October 8, 2014Warren Kibbe
 
Biocuration activities for the International Cancer Genome Consortium (ICGC).
Biocuration activities for the International Cancer Genome Consortium (ICGC).Biocuration activities for the International Cancer Genome Consortium (ICGC).
Biocuration activities for the International Cancer Genome Consortium (ICGC).Neuro, McGill University
 
FAIR as a Working Principle for Cancer Genomic Data
FAIR as a Working Principle for Cancer Genomic DataFAIR as a Working Principle for Cancer Genomic Data
FAIR as a Working Principle for Cancer Genomic DataIan Fore
 
NCI Cancer Genomics, Open Science and PMI: FAIR
NCI Cancer Genomics, Open Science and PMI: FAIR NCI Cancer Genomics, Open Science and PMI: FAIR
NCI Cancer Genomics, Open Science and PMI: FAIR Warren Kibbe
 
A Comprehensive Evaluation of Machine Learning Approaches for Breast Cancer C...
A Comprehensive Evaluation of Machine Learning Approaches for Breast Cancer C...A Comprehensive Evaluation of Machine Learning Approaches for Breast Cancer C...
A Comprehensive Evaluation of Machine Learning Approaches for Breast Cancer C...IRJET Journal
 
National Cancer Data Ecosystem and Data Sharing
National Cancer Data Ecosystem and Data SharingNational Cancer Data Ecosystem and Data Sharing
National Cancer Data Ecosystem and Data SharingWarren Kibbe
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...CSCJournals
 
Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...
Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...
Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...Warren Kibbe
 
Keynote at NVIDIA GPU Technology Conference in D.C.
Keynote at NVIDIA GPU Technology Conference in D.C.Keynote at NVIDIA GPU Technology Conference in D.C.
Keynote at NVIDIA GPU Technology Conference in D.C.Jerry Lee
 

Similar to Big Data Training for Cancer Research, Purdue, May 2023 (20)

Nci clinical genomics data sharing ncra sept 2016
Nci clinical genomics data sharing ncra sept 2016Nci clinical genomics data sharing ncra sept 2016
Nci clinical genomics data sharing ncra sept 2016
 
Enabling Translational Medicine with e-Science
Enabling Translational Medicine with e-ScienceEnabling Translational Medicine with e-Science
Enabling Translational Medicine with e-Science
 
The BRCA Challenge & Exchange: Progress and Plans - Gunnar Rätsch
The BRCA Challenge & Exchange: Progress and Plans - Gunnar RätschThe BRCA Challenge & Exchange: Progress and Plans - Gunnar Rätsch
The BRCA Challenge & Exchange: Progress and Plans - Gunnar Rätsch
 
Big data
Big dataBig data
Big data
 
Data Commons & Data Science Workshop
Data Commons & Data Science WorkshopData Commons & Data Science Workshop
Data Commons & Data Science Workshop
 
coad_machine_learning
coad_machine_learningcoad_machine_learning
coad_machine_learning
 
ICBO 2014, October 8, 2014
ICBO 2014, October 8, 2014ICBO 2014, October 8, 2014
ICBO 2014, October 8, 2014
 
fnano-04-972421.pdf
fnano-04-972421.pdffnano-04-972421.pdf
fnano-04-972421.pdf
 
Biocuration activities for the International Cancer Genome Consortium (ICGC).
Biocuration activities for the International Cancer Genome Consortium (ICGC).Biocuration activities for the International Cancer Genome Consortium (ICGC).
Biocuration activities for the International Cancer Genome Consortium (ICGC).
 
FAIR as a Working Principle for Cancer Genomic Data
FAIR as a Working Principle for Cancer Genomic DataFAIR as a Working Principle for Cancer Genomic Data
FAIR as a Working Principle for Cancer Genomic Data
 
JALANov2000
JALANov2000JALANov2000
JALANov2000
 
NCI Cancer Genomics, Open Science and PMI: FAIR
NCI Cancer Genomics, Open Science and PMI: FAIR NCI Cancer Genomics, Open Science and PMI: FAIR
NCI Cancer Genomics, Open Science and PMI: FAIR
 
A Comprehensive Evaluation of Machine Learning Approaches for Breast Cancer C...
A Comprehensive Evaluation of Machine Learning Approaches for Breast Cancer C...A Comprehensive Evaluation of Machine Learning Approaches for Breast Cancer C...
A Comprehensive Evaluation of Machine Learning Approaches for Breast Cancer C...
 
National Cancer Data Ecosystem and Data Sharing
National Cancer Data Ecosystem and Data SharingNational Cancer Data Ecosystem and Data Sharing
National Cancer Data Ecosystem and Data Sharing
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
 
Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...
Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...
Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...
 
Madrid icgc pcawg_2016_slideshare
Madrid icgc pcawg_2016_slideshareMadrid icgc pcawg_2016_slideshare
Madrid icgc pcawg_2016_slideshare
 
16
1616
16
 
Nov 2014 ouellette_windsor_icgc_final
Nov 2014 ouellette_windsor_icgc_finalNov 2014 ouellette_windsor_icgc_final
Nov 2014 ouellette_windsor_icgc_final
 
Keynote at NVIDIA GPU Technology Conference in D.C.
Keynote at NVIDIA GPU Technology Conference in D.C.Keynote at NVIDIA GPU Technology Conference in D.C.
Keynote at NVIDIA GPU Technology Conference in D.C.
 

More from Warren Kibbe

CCDI Kibbe Wake Forest University Dec 2023.pptx
CCDI Kibbe Wake Forest University Dec 2023.pptxCCDI Kibbe Wake Forest University Dec 2023.pptx
CCDI Kibbe Wake Forest University Dec 2023.pptxWarren Kibbe
 
CCDI Overview November 2022
CCDI Overview November 2022CCDI Overview November 2022
CCDI Overview November 2022Warren Kibbe
 
RADx-UP CDCC Overview November 2022
RADx-UP CDCC Overview November 2022RADx-UP CDCC Overview November 2022
RADx-UP CDCC Overview November 2022Warren Kibbe
 
CCDI Kibbe Big Data Training May 2022
CCDI Kibbe Big Data Training May 2022CCDI Kibbe Big Data Training May 2022
CCDI Kibbe Big Data Training May 2022Warren Kibbe
 
Real world data, the National COVID-19 Cohort Consortium, and Oncology 2021
Real world data, the National COVID-19 Cohort Consortium, and Oncology 2021Real world data, the National COVID-19 Cohort Consortium, and Oncology 2021
Real world data, the National COVID-19 Cohort Consortium, and Oncology 2021Warren Kibbe
 
Childhood Cancer Data Initiative presentation to the Children’s Brain Tumor N...
Childhood Cancer Data Initiative presentation to the Children’s Brain Tumor N...Childhood Cancer Data Initiative presentation to the Children’s Brain Tumor N...
Childhood Cancer Data Initiative presentation to the Children’s Brain Tumor N...Warren Kibbe
 
RADx-UP CDCC presentation for the NIH Disaster Interest Group
RADx-UP CDCC presentation for the NIH Disaster Interest GroupRADx-UP CDCC presentation for the NIH Disaster Interest Group
RADx-UP CDCC presentation for the NIH Disaster Interest GroupWarren Kibbe
 
DCHI webinar on N3C January 2021
DCHI webinar on N3C January 2021DCHI webinar on N3C January 2021
DCHI webinar on N3C January 2021Warren Kibbe
 
NCI HTAN, cancer trajectories, precision oncology
NCI HTAN, cancer trajectories, precision oncologyNCI HTAN, cancer trajectories, precision oncology
NCI HTAN, cancer trajectories, precision oncologyWarren Kibbe
 
Technology and connected health for population science kibbe duke jan 2020
Technology and connected health for population science kibbe duke jan 2020Technology and connected health for population science kibbe duke jan 2020
Technology and connected health for population science kibbe duke jan 2020Warren Kibbe
 
Super computing 19 Cancer Computing Workshop Keynote
Super computing 19 Cancer Computing Workshop KeynoteSuper computing 19 Cancer Computing Workshop Keynote
Super computing 19 Cancer Computing Workshop KeynoteWarren Kibbe
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
 
Data supporting precision oncology fda wakibbe
Data supporting precision oncology fda wakibbeData supporting precision oncology fda wakibbe
Data supporting precision oncology fda wakibbeWarren Kibbe
 
Role of data in precision oncology
Role of data in precision oncologyRole of data in precision oncology
Role of data in precision oncologyWarren Kibbe
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
 
Data sharing Webinar March 2019
Data sharing Webinar March  2019Data sharing Webinar March  2019
Data sharing Webinar March 2019Warren Kibbe
 

More from Warren Kibbe (20)

CCDI Kibbe Wake Forest University Dec 2023.pptx
CCDI Kibbe Wake Forest University Dec 2023.pptxCCDI Kibbe Wake Forest University Dec 2023.pptx
CCDI Kibbe Wake Forest University Dec 2023.pptx
 
CCDI Overview November 2022
CCDI Overview November 2022CCDI Overview November 2022
CCDI Overview November 2022
 
RADx-UP CDCC Overview November 2022
RADx-UP CDCC Overview November 2022RADx-UP CDCC Overview November 2022
RADx-UP CDCC Overview November 2022
 
CCDI Kibbe Big Data Training May 2022
CCDI Kibbe Big Data Training May 2022CCDI Kibbe Big Data Training May 2022
CCDI Kibbe Big Data Training May 2022
 
Real world data, the National COVID-19 Cohort Consortium, and Oncology 2021
Real world data, the National COVID-19 Cohort Consortium, and Oncology 2021Real world data, the National COVID-19 Cohort Consortium, and Oncology 2021
Real world data, the National COVID-19 Cohort Consortium, and Oncology 2021
 
Childhood Cancer Data Initiative presentation to the Children’s Brain Tumor N...
Childhood Cancer Data Initiative presentation to the Children’s Brain Tumor N...Childhood Cancer Data Initiative presentation to the Children’s Brain Tumor N...
Childhood Cancer Data Initiative presentation to the Children’s Brain Tumor N...
 
RADx-UP CDCC presentation for the NIH Disaster Interest Group
RADx-UP CDCC presentation for the NIH Disaster Interest GroupRADx-UP CDCC presentation for the NIH Disaster Interest Group
RADx-UP CDCC presentation for the NIH Disaster Interest Group
 
DCHI webinar on N3C January 2021
DCHI webinar on N3C January 2021DCHI webinar on N3C January 2021
DCHI webinar on N3C January 2021
 
NCATS CTSA N3C
NCATS CTSA N3C NCATS CTSA N3C
NCATS CTSA N3C
 
NAACCR June 2020
NAACCR June 2020NAACCR June 2020
NAACCR June 2020
 
NCI HTAN, cancer trajectories, precision oncology
NCI HTAN, cancer trajectories, precision oncologyNCI HTAN, cancer trajectories, precision oncology
NCI HTAN, cancer trajectories, precision oncology
 
ENAR 2020
ENAR 2020ENAR 2020
ENAR 2020
 
ENAR 2020
ENAR 2020ENAR 2020
ENAR 2020
 
Technology and connected health for population science kibbe duke jan 2020
Technology and connected health for population science kibbe duke jan 2020Technology and connected health for population science kibbe duke jan 2020
Technology and connected health for population science kibbe duke jan 2020
 
Super computing 19 Cancer Computing Workshop Keynote
Super computing 19 Cancer Computing Workshop KeynoteSuper computing 19 Cancer Computing Workshop Keynote
Super computing 19 Cancer Computing Workshop Keynote
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Data supporting precision oncology fda wakibbe
Data supporting precision oncology fda wakibbeData supporting precision oncology fda wakibbe
Data supporting precision oncology fda wakibbe
 
Role of data in precision oncology
Role of data in precision oncologyRole of data in precision oncology
Role of data in precision oncology
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Data sharing Webinar March 2019
Data sharing Webinar March  2019Data sharing Webinar March  2019
Data sharing Webinar March 2019
 

Recently uploaded

College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...Miss joya
 
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy GirlsCall Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girlsnehamumbai
 
Call Girl Chennai Indira 9907093804 Independent Call Girls Service Chennai
Call Girl Chennai Indira 9907093804 Independent Call Girls Service ChennaiCall Girl Chennai Indira 9907093804 Independent Call Girls Service Chennai
Call Girl Chennai Indira 9907093804 Independent Call Girls Service ChennaiNehru place Escorts
 
Russian Call Girls Chennai Madhuri 9907093804 Independent Call Girls Service ...
Russian Call Girls Chennai Madhuri 9907093804 Independent Call Girls Service ...Russian Call Girls Chennai Madhuri 9907093804 Independent Call Girls Service ...
Russian Call Girls Chennai Madhuri 9907093804 Independent Call Girls Service ...Nehru place Escorts
 
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on DeliveryCall Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Deliverynehamumbai
 
Call Girls Chennai Megha 9907093804 Independent Call Girls Service Chennai
Call Girls Chennai Megha 9907093804 Independent Call Girls Service ChennaiCall Girls Chennai Megha 9907093804 Independent Call Girls Service Chennai
Call Girls Chennai Megha 9907093804 Independent Call Girls Service ChennaiNehru place Escorts
 
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...narwatsonia7
 
Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000aliya bhat
 
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment BookingHousewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Bookingnarwatsonia7
 
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Aspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas AliAspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas AliRewAs ALI
 
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% SafeBangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safenarwatsonia7
 
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...Miss joya
 
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Miss joya
 
Hi,Fi Call Girl In Mysore Road - 7001305949 | 24x7 Service Available Near Me
Hi,Fi Call Girl In Mysore Road - 7001305949 | 24x7 Service Available Near MeHi,Fi Call Girl In Mysore Road - 7001305949 | 24x7 Service Available Near Me
Hi,Fi Call Girl In Mysore Road - 7001305949 | 24x7 Service Available Near Menarwatsonia7
 
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls ServiceKesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Servicemakika9823
 
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore EscortsVIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escortsaditipandeya
 
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...Miss joya
 

Recently uploaded (20)

College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
 
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy GirlsCall Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
 
Call Girl Chennai Indira 9907093804 Independent Call Girls Service Chennai
Call Girl Chennai Indira 9907093804 Independent Call Girls Service ChennaiCall Girl Chennai Indira 9907093804 Independent Call Girls Service Chennai
Call Girl Chennai Indira 9907093804 Independent Call Girls Service Chennai
 
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
 
Russian Call Girls Chennai Madhuri 9907093804 Independent Call Girls Service ...
Russian Call Girls Chennai Madhuri 9907093804 Independent Call Girls Service ...Russian Call Girls Chennai Madhuri 9907093804 Independent Call Girls Service ...
Russian Call Girls Chennai Madhuri 9907093804 Independent Call Girls Service ...
 
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on DeliveryCall Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
 
Call Girls Chennai Megha 9907093804 Independent Call Girls Service Chennai
Call Girls Chennai Megha 9907093804 Independent Call Girls Service ChennaiCall Girls Chennai Megha 9907093804 Independent Call Girls Service Chennai
Call Girls Chennai Megha 9907093804 Independent Call Girls Service Chennai
 
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
 
Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000
 
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment BookingHousewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
 
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
 
Aspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas AliAspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas Ali
 
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% SafeBangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
 
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
 
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
 
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Servicesauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
 
Hi,Fi Call Girl In Mysore Road - 7001305949 | 24x7 Service Available Near Me
Hi,Fi Call Girl In Mysore Road - 7001305949 | 24x7 Service Available Near MeHi,Fi Call Girl In Mysore Road - 7001305949 | 24x7 Service Available Near Me
Hi,Fi Call Girl In Mysore Road - 7001305949 | 24x7 Service Available Near Me
 
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls ServiceKesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
 
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore EscortsVIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escorts
 
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
 

Big Data Training for Cancer Research, Purdue, May 2023

  • 1. An Overview of Cancer Data Repositories May 24, 2023 Warren A. Kibbe, Ph.D. Vice Chair & Professor of Biostatistics and Bioinformatics Duke University School of Medicine warren.kibbe@duke.edu
  • 2. We will take a tour of: AACR Project GENIE NCI Genomic Data Commons The Gabriela Miller Kids First Data Portal dbGaP The Human BioMolecular Atlas Program (HuBMAP) Human Tumor Atlas Network (HTAN) CIViC DB MyCancerGenome COSMIC Childhood Cancer Data Catalog 5/24/2023 By no means are these the only or even the most important ones!
  • 3. Our ability to generate biomedical data continues to grow in terms of variety and volume Current sources of data molecular genome pathology imaging labs notes sensors icons by the Noun Project
  • 4. Things to think about 5/24/2023 Many different metaphors for building a data repository Many different methods for making data FAIR (Findable, Accessible, Interoperable, Reusable) Data wrangling – organizing, formatting, harmonizing, semantically annotating – is hard work and different resources take different approaches Usability is dependent on the use case, the tools, the problem domain
  • 5. 5 Design Considerations WHERE DO YOU WANT FLEXIBILITY? FILE FORMATS? DATA TYPES? DEFINITIONS? HOW MUCH CAN YOU ANTICIPATE AT THE BEGINNING? WHERE IS IT LIKELY THAT YOUR REQUIREMENTS WILL CHANGE? WHO ARE YOUR USERS? BIOLOGISTS? CHEMISTS? CLINICIANS? QUANTS? WILL THEY COME TO EXPLORE (HYPOTHESIS GENERATION)? WILL THEY BRING AN EXISTING QUESTION (HYPOTHESIS ANSWERING)? DO THEY WANT TO USE THEIR OWN ANALYSIS TOOLS? DO THEY WANT TO BUILD NEW TOOLS TO EXPLORE THE DATA?
  • 6. What is the right data resource to use? 5/24/2023 What is the question you want to answer? Are you exploring how to answer the problem or ready to analyze? Do you need to explore a dataset to understand if the data is available? Does the resource support the kind of exploration you want to do? Do you have an analysis plan and workflow defined? Are your analysis tools appropriate for the resource you want to use?
  • 7. 7 What is the right data resource to use? Does it have the right kind of data? (clinical trial data, scRNA-seq, specific cell line, disease area, model system) What are the access policies? (unrestricted, controlled access) Level of security required (none, limited, regulated [think HIPAA], very sensitive) Ability to peruse the metadata? Can you download the data? If it is an enclave/closed ecosystem, what tools are supported? Can you bring your own tools?
  • 8. A useful catalog of data resources 5/24/2023 https://datacatalog.ccdi.cancer.gov
  • 9. Childhood Cancer Data Catalog: TCIA projects 5/24/2023
  • 10. A short scenario You are interested in exploring non-small cell lung cancer and identifying the top 5 mutations. You want plot either progression free survival or a survival curve for each mutation. Possible data sources: AACR Project GENIE and the NCI Genomic Data Commons (part of the NCI Cancer Research Data Commons)
  • 11. AACR Project GENIE cBioPortal resource 5/24/2023 https://genie.cbioportal.org Start with Cohort 13.1 – data from 167,358 samples
  • 12. AACR Project GENIE cbioportal resource 5/24/2023 https://genie.cbioportal.org Select the 24,110 NSCLC
  • 13. AACR Project GENIE cbioportal resource 5/24/2023 https://genie.cbioportal.org NSLC View
  • 14. AACR Project GENIE cbioportal resource 5/24/2023 https://genie.cbioportal.org Lung Adenomacarcinoma 18,430 cases
  • 15. AACR Project GENIE cbioportal resource 5/24/2023 https://genie.cbioportal.org ‘Manhattan’ plot of mutations in TP53
  • 16. AACR Project GENIE cBioPortal resource 5/24/2023 https://genie.cbioportal.org Not a survival analysis! Looks like patients with somatic TP53 mutations die younger Many potential confounders!
  • 17. cBioPortal makes plotting survival data easy – but not with GENIE data! This plot is men vs women with bladder cancer 5/24/2023 https://cbioportal.org
  • 18. Now we will explore the NCI Genomic Data Commons 5/24/2023 https://portal.gdc.cancer.gov
  • 19. Genomic Data Commons 5/24/2023 https://portal.gdc.cancer.gov Lung cancer – 12,262 cases
  • 20. Genomic Data Commons 5/24/2023 https://portal.gdc.cancer.gov Lung cancer - 12,262 cases Only 1,475 have survival data
  • 21. Genomic Data Commons 5/24/2023 https://portal.gdc.cancer.gov Genes view
  • 22. Genomic Data Commons 5/24/2023 https://portal.gdc.cancer.gov Genes view Notice the case numbers TP53
  • 23. Genomic Data Commons 5/24/2023 https://portal.gdc.cancer.gov Genes view Notice the case numbers TTN
  • 24. Genomic Data Commons 5/24/2023 https://portal.gdc.cancer.gov Genes view Notice the case numbers MUC16
  • 25. Genomic Data Commons 5/24/2023 https://portal.gdc.cancer.gov Genes view Notice the case numbers CSMD3
  • 26. Genomic Data Commons 5/24/2023 https://portal.gdc.cancer.gov Genes view Notice the case numbers RYR2
  • 27. Genomic Data Commons 5/24/2023 https://portal.gdc.cancer.gov Genes view Notice the case numbers USH2A
  • 28. Genomic Data Commons 5/24/2023 https://portal.gdc.cancer.gov Genes view Notice the case numbers LRP1B
  • 29. Genomic Data Commons 5/24/2023 https://portal.gdc.cancer.gov Genes view Notice the case numbers ZFHX4
  • 30. Genomic Data Commons 5/24/2023 https://portal.gdc.cancer.gov Genes view Notice the case numbers XIRP2
  • 31. Survival analysis out of the box  The fact that mutations in all of those genes appears to convey a survival advantage is puzzling and contradicts the simple longevity curve we got from cBioPortal in GENIE  Strange drop in survival at about 5 years  Need a lot more analysis before believing the plot! 5/24/2023
  • 32. Revisit: AACR Project GENIE cBioPortal resource 5/24/2023 https://genie.cbioportal.org Not a survival analysis! Looks like patients with somatic TP53 mutations die younger Many potential confounders!
  • 33. Survival analysis out of the box  The fact that mutations in all of those genes appears to convey a survival advantage is puzzling and contradicts the simple longevity curve we got from cBioPortal in GENIE  There are multiple potential explanations  TP53 mutations arise in younger patients and their survival with disease is longer (maybe?)  There are one or more confounders in the TCGA data, such as a skew toward late stage disease for patients with TP53 mutations (and the other mutations as well!) (definitely!)  Patients are from non-comparable cohorts (probably)  Other confounders you can think of? 5/24/2023
  • 34. Survival analysis out of the box  These problems and having consistent data entry from all the cases is why clinical trials are the standard for interpreting Progression Free Survival and performing survival analysis  This is a taste of some of the issues in interpreting real world data – knowing that the cases and controls are balanced for known confounders  Knowing which questions are appropriate for a given dataset is critical 5/24/2023
  • 35. 35 Confounders A story about lung cancer and exercise Sometimes we overlook the obvious Don’t forget to include known confounders!
  • 36. Background on TCGA and GENIE  The Cancer Genome Atlas (TCGA) is a long running study of all comers across many institutions, however specific institutions contributed specimens and data predominantly from one disease site  TCGA is biased toward larger tumors and stage 3  TCGA was research sequencing, so not with therapeutic intent  TCGA is primarily research whole exome sequencing  GENIE is also an ‘all comers’ study. Sequence comes from a variety of in house and commercial clinical sequencing platforms. Primarily targeted sequencing. Based on perceived clinical benefit for the patient 5/24/2023
  • 37. More on TCGA and GENIE  Follow-up, therapeutic lines of treatment, PFS, recurrence are not consistently captured  Diagnosis and stage at specimen collection are fairly consistent 5/24/2023
  • 38. Good questions for TCGA and GENIE  Prevalence of mutations in a disease area  Mutational frequency at time of diagnosis  Mutational patterns at time of diagnosis  Age at diagnosis 5/24/2023
  • 39. Beyond TCGA and GENIE 5/24/2023 The next generation repository will need to carefully capture lines of therapy, multiple disease measures, tumor evolution, and more detailed biomarker measurements like scRNAseq, proteomics, metabolomics, tumor microenvironment Deep characterization of normal tissues, The Human BioMolecular Atlas Program (HuBMAP) https://commonfund.nih.gov/hubmap. For cancer progression from precancerous lesions through to advanced disease The Human Tumor Atlas Network (HTAN) https://humantumoratlas.org HuBMAP and HTAN are laying the foundation for the next stage of repositories
  • 44. Data Access 5/24/2023 For the example analyses, we used the public (open access) tier of data from Project GENIE and TCGA. For detailed sequence and clinical data you generally need to have restricted access For NIH studies, restricted access data requires submitting a dbGaP access request
  • 46. dbGaP – Access to Kids First data 5/24/2023
  • 47. An auxiliary way to look at dbGaP projects 5/24/2023 Also doesn’t help see variable names, but does list file types
  • 48. dbGaP and access to metadata  With the exception of seeing the number of participants, things like what types of experimental variables (targeted sequencing, whole exome, whole genome, RNAseq, scRNAseq, methylSeq, etc), clinical variables (diagnosis, stage, treatments, adverse events), demographic variables (age, race, ethnicity, urban vs rural,etc) are not available for many projects until you have access to the full dataset.  This makes it hard to identify which specific datasets you might be interested in 5/24/2023
  • 49. To understand Kids First, go to it First! 5/24/2023 https://kidsfirstdrc.org
  • 50. Some stats about Kids First, even before creating a login 5/24/2023 https://kidsfirstdrc.org
  • 51. Kids First login or sign up 5/24/2023 You can use an ORCID or a google account
  • 52. You do have to accept the terms & conditions 5/24/2023
  • 53. You can see the studies 5/24/2023
  • 54. Drill down on the childhood cancer studies 5/24/2023
  • 55. Select Neuroblastoma – 1681 participants 5/24/2023
  • 56. Quick visual breakdown, including survival analysis 5/24/2023
  • 57. Looking at variants, more than 280 million of them! 5/24/2023
  • 58. Select gene, then enter TP53 5/24/2023
  • 59. Now we see the variants for TP53 5/24/2023
  • 60. Now select for pathogenic variants – 6 of them 5/24/2023
  • 61. Add likely pathogenic – now we have 8 variants 5/24/2023
  • 62. 62 TP53 is not druggable: Lets look at something clinically actionable EGFR T790M is a common resistance mutation for patients with NSLC getting Tyrosine Kinase Inhibitors (TKIs) like gefitinib and erlotinib, which are small molecule targeted therapies. We are now going to start looking at EGFR.
  • 65. The COSMIC Database https://cancer.sanger.ac.uk/cosmic Very different purpose – it is to catalog somatic mutations associated with cancer
  • 69. COSMIC view of mutations in and around EGFR
  • 70. COSMIC view of mutations around EGFR
  • 71. COSMIC view of EGFR https://cancer.sanger.ac.uk/cosmic/gene/analysis?ln=EGFR
  • 75. Most common cancers with EGFR Mutations in GENIE
  • 76. Drugs in MyCancerGenome – selectingTKIs
  • 77. gefitinib in MyCancerGenome What about exclusion biomarkers like EGFR T790M?
  • 78. Going back to the NCI Genomic Data Commons https://portal.gdc.cancer.gov
  • 80. Look for EGFR EGFR T790M mutations arise as resistance mutations to 1st or 2nd gen TKI therapy So it is not common in patients/cancers who do not get these therapies
  • 82. CIViC DB Similar to ClinVar, ClinGen, MyCancerGenome
  • 84. CIViC DB – EGFRT790M Mutation effect in the description is clear Mutation effect on these drugs is not clear
  • 85. CIViC DB – erlotinib Mutation effect on erlotinib is not clear
  • 86. In conclusion 5/24/2023 Many different metaphors for building a data repository Many different methods for making data FAIR (Findable, Accessible, Interoperable, Reusable) Data wrangling – organizing, formatting, harmonizing, semantically annotating – is hard work and different resources take different approaches Usability is dependent on the use case, the tools, the problem domain
  • 87. What is the right data resource to use? 5/24/2023 What is the question you want to answer? Are you exploring how to answer the problem or ready to analyze? Do you need to explore a dataset to understand if the data is available? Does the resource support the kind of exploration you want to do? Do you have an analysis plan and workflow defined? Are your analysis tools appropriate for the resource you want to use?
  • 88. 88 What is the right data resource to use? Does it have the right kind of data? (clinical trial data, scRNA-seq, specific cell line, disease area, model system) What are the access policies? (unrestricted, controlled access) Level of security required (none, limited, regulated [think HIPAA], very sensitive) Ability to peruse the metadata? Can you download the data? If it is an enclave/closed ecosystem, what tools are supported? Can you bring your own tools?
  • 89. Just ask yourself a few pertinent questions as you use them! There are many resources out there to use
  • 92. What is Real World Data? Collected in the context of patient care. Real World Data was called out as part of the 21st Century Cures Act 21st Century Cures Act: https://www.fda.gov/regulatory-information/selected-amendments-fdc-act/21st-century-cures-act Graphic from HealthCatalyst: https://www.healthcatalyst.com/insights/real-world-data-chief-driver-drug-development

Editor's Notes

  1. Select non-small cell lung cancer
  2. Select non-small cell lung cancer. Copy number changes, genes with mutations, structural variants.
  3. Select non-small cell lung cancer mutations in TP53
  4. Select non-small cell lung cancer mutations in TP53
  5. Select non-small cell lung cancer mutations in TP53
  6. Select non-small cell lung cancer mutations in TP53
  7. Study was published without taking into account the distribution of smoking status, which lead to an erroneous set of conclusions