Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Genomics and Computation in Precision Medicine March 2017


Published on

Insights from Genomics and Computational Biology that inform precision medicine. Given at City of Hope Comprehensive Cancer Center

Published in: Health & Medicine

Genomics and Computation in Precision Medicine March 2017

  1. 1. Insights from Genomics and Computational Biology that inform Precision Medicine Warren Kibbe, PhD @wakibbe March 6th, 2017
  2. 2. 2 1. Background 2. Data in Biomedicine 3. Data Sharing 4. Data Commons 5. Genomics and Computation Thanks to many folks for slides, but especially Jerry Lee
  3. 3. 3 In 2016 there were an estimated 1,700,000 new cancer cases and 600,000 cancer deaths - American Cancer Society Cancer remains the second most common cause of death in the U.S. - Centers for Disease Control and Prevention
  4. 4. 4 In 2016 there are an estimated 15,500,000 cancer survivors in the U. S.
  5. 5. 5 Understanding Cancer  Precision medicine will lead to fundamental understanding of the complex interplay between genetics, epigenetics, nutrition, environment and clinical presentation and direct effective, evidence-based prevention and treatment.
  6. 6. 6 (10,000+ patient tumors and increasing) Courtesy of P. Kuhn (USC) 2006-2015: A Decade of Illuminating the Underlying Causes of Primary Untreated Tumors Omics Characterization Cancer is a grand challenge Deep biological understanding Advances in scientific methods Advances in instrumentation Advances in technology Data and computation Cancer Research and Care generate detailed data that is critical to create a learning health system for cancer Requires:
  7. 7. 7 (10,000+ patient tumors and increasing) Courtesy of P. Kuhn (USC) 2006-2015: A Decade of Illuminating the Underlying Causes of Primary Untreated Tumors Omics Characterization
  8. 8. The primary goal of TCGA is to identify disease subtypes and generate a comprehensive catalog of all cancer genes and pathways Pilot phase started in 2006 -- 3 tumor types 500 T/N pairs each Extended in 2009 -- 20 x 500 T/N + 15 x (50-150) T/N Freely available genomic data enable researchers anywhere around the world to make and validate important discoveries
  9. 9. 9
  10. 10. 10
  11. 11. 11
  12. 12. Biological Scales
  13. 13. 13 Tumor, Cancer, and Metastasis: (Length-scale and Time-scale Matter) “…>90% of deaths are caused by disseminated disease or metastasis…” Gupta et. al., Cell, 2006 and Siegel et. al. CA Cancer J Clin, Jan/Feb 2016 5 year Relative Survival Rates (2016 report of 2005-2011 data)
  14. 14. 14 1997 20152001 20051998 20142002 20061999 2003 20112000 2004 2007 20122008 20132009 2010 10/23/2001 (~5 yrs old) 4/21/1997 1/9/2007 (~10 yrs old) iPod (10GB max) WinAMP(mp3) iPhone (EDGE, 16 GB max) 9/16/1999 (~3 yrs old) 802.11b WiFi 4/3/2010 (~13 yrs old) iPad (EDGE, 64 GB max) 4/23/2005 (~8 yrs old) 9/26/2006 (~9 yrs old) 7/15/2006 2/7/2007 Google Drive 4/24/2012 (~15 yrs old)7/11/2008 (~11 yrs old) iPhone 3G (16 GB max) 9/12/2012 (~15 yrs old) iPhone5 (LTE, 128 GB max) Google Baseline 7/14/2014 (~17 yrs old) 3/9/2015 (~18 yrs old) Digital technology is changing rapidly
  15. 15. 15 $640M (FY74) $5.21 B (FY16) Cancer therapy is changing
  16. 16. 1618 Application of Cancer Genomics is changing
  17. 17. 17
  18. 18. 18
  19. 19. 19 Biology and Medicine are now data intensive enterprises Scale is rapidly changing Technology, data, computing and IT are pervasive in the lab, the clinic, in the home, and across the population
  20. 20. 20
  21. 21. 21
  22. 22. 22 “ advantage of machine learning is that it can be used even in cases where it is infeasible or difficult to write down explicit rules to solve a problem...”
  23. 23. 23
  24. 24. 24
  25. 25. 25 " there is great potential for new insights to come from the combined analysis of cancer proteomic and genomic data, as proteomic data can now reproducibly provide information about protein levels and activities that are difficult or impossible to infer from genomic data alone ” Douglas R. Lowy, MD Acting Director of the National Cancer Institute, National Institutes of Health 5/25/2016
  26. 26. 26 Genomics is the beginning of precision medicine, not the end!
  27. 27. 27 Plumbing, architecture, genomics
  28. 28. Cancer Research Data Ecosystem – Cancer Moonshot BRP Well characterized research data sets Cancer cohorts Patient data EHR, Lab Data, Imaging, PROs, Smart Devices, Decision Support Learning from every cancer patient Active research participation Research information donor Clinical Research Observational studies Proteogenomics Imaging data Clinical trials Discovery Patient engaged Research Surveillance Big Data Implementation research SEERGDC
  29. 29. Data Commons Framework
  30. 30. 30 Data Commons Structure DICOM, AIM Amazon Google IBM Imaging Validator Q/A Proteomic Validator Q/A Clinical Phenotype Validator Q/A MOD Phenotype Validator Q/A Pathology Radiology Mass Spectrometry Array Data Commons Security Visualization Authentication & Authorization Genomic Validator Q/A Germline Pipelines DNA, RNA Pipelines EMRs, Clinical Trials Azure Data Contributors and Consumers Researchers PatientsCliniciansInstitutions NCI Thesaurus caDSR NLM UMLS RxNorm LOINC SNOMED
  31. 31. 31 Cancer Data Sharing & Data Commons • Support open science • Support data reusability • Aligned with Cancer Moonshot • Part of Precision Medicine • Reduce Health Disparities • Improve patient access to clinical trials • Work toward a learning National Cancer Data Ecosystem Reduce the risk, improve early detection, outcomes and survivorship in cancer
  32. 32. Cancer Data Sharing Efforts Signature Efforts Data BRCA Challenge Somatic variant sharing Isolated genetic variants No raw sequencing data Precision medicine questions Somatic variant sharing Panel gene resequencing Clinical response Clinical trial Public-private partnerships Comprehensive genomics Detailed clinical phenotype data Clinical trial access Clinical/genomic data aggregation EHR data Clinical sequencing Clinical oncology standards EHR data Clinical sequencing
  33. 33. 33 Changing the conversation around data sharing  How do we find data, software, standards?  How can we make data, annotations, software, metadata accessible?  How do we reuse data standards?  How do we make more data machine readable? NIH Data Commons NCI Genomic Data Commons National Cancer Data Ecosystem Data Commons co-locate data, storage and computing infrastructure, and frequently used tools for analyzing and sharing data to create an interoperable resource for the research community. *Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson, A Case for Data Commons Towards Data Science as a Service, to appear. Source of image: Interior of one of Google’s Data Center,
  34. 34. 34 Basic Ingredients of a Cancer Research Data Ecosystem • Open Science. Supporting Open Access, Open Data, Open Source, and Data Liquidity for the cancer community • Standardization through CDEs and Case Report Forms • Interoperability by exposing existing knowledge through appropriate integration of ontologies, vocabularies and taxonomies • Consistency of curation and capture of evidence (tissue types, recurrence, therapy – including genomic, epigenomic, etc context) • Sustainable models for informatics infrastructure, services, data, metadata
  35. 35. 35 GDC as an example of a new architecture for storing and sharing cancer data
  36. 36. 36 The Cancer Genomic Data Commons (GDC) is an existing effort to standardize and simplify submission of genomic data to NCI and follow the principles of FAIR – Findable, Accessible, Attributable, Interoperable, Reusable, and Provide Recognition. The GDC is part of the NIH Big Data to Knowledge (BD2K) initiative and an example of the NIH Commons Genomic Data Commons Microattribution, nanopublications, tracking the use of data, annotation of data, use of algorithms, supports the data /software /metadata life cycle to provide credit and analyze impact of data, software, analytics, algorithm, curation and knowledge sharing Force11 white paper
  37. 37. 37 The NCI Genomic Data Commons • Unify fragmentary repositories • Support the receipt, quality control, integration, storage, and redistribution of standardized genomic data sets derived from cancer research studies • Harmonization of raw sequence both from existing and new cancer research programs • Application of state-of-the-art methods of generating derived genomic data • Provide the foundation for: • Identification of high- and low-frequency cancer drivers • Defining genomic determinants of response to therapy • Clinical trial cohorts sharing targeted genetic lesions
  38. 38. NCI Genomic Data Commons  The GDC went live on June 6, 2016 with approximately 4.1 PB of data.  577,878 files about 14194 cases (patients), in 42 cancer types, across 29 primary disease sites, 400 clinical data elements  10 major data types, ranging from Raw Sequencing Data, Raw Microarray Data, to Copy Number Variation, Simple Nucleotide Variation and Gene Expression.  Data are derived from 17 different experimental strategies, with the major ones being RNA-Seq, WXS, WGS, miRNA-Seq, Genotyping Array and Expression Array.  Foundation Medicine announced the release of 18,000 genomic profiles to the GDC at the Cancer Moonshot Summit, June 29th, 2016  The Multiple Myeloma Research Foundation announced it would be releasing its CoMMpass study of more than 1000 cases of Multiple Myeloma on Sept 29, 2016.
  39. 39. 39 Content in the Genomic Data Commons  TCGA 11,353 cases  TARGET 3,178 cases Current  Foundation Medicine 18,000 cases  Cancer studies in dbGaP ~4,000 cases  Multiple Myeloma RF ~1,000 cases Coming soon  NCI-MATCH ~3,000 cases  Clinical Trial Sequencing Program ~3,000 cases Planned (1-3 years)  Cancer Driver Discovery Program ~5,000 cases  Human Cancer Model Initiative ~1,000 cases  APOLLO – NCI, VA and DoD ~8,000 cases ~58,000 cases
  40. 40. Exome-seq Whole genome-seq RNA-seq Copy number Genome alignment Genome alignment Genome alignment Data segmentation 1° processing Mutations Mutations + structural variants Digital gene expression Copy number calls 2° processing Oncogene vs. Tumor suppressor Translocations Relative RNA levels Alternative splicing Gene amplification/ deletion 3° processing GDC Data Harmonization Multiple data types and levels of processing
  41. 41. Mutect2 pipeline GDC Data Harmonization Open Source, Dockerized Pipelines
  42. 42. Recovery rate (% true positives) A0F0 SomaticSniper 81.1% 76.5% VarScan 93.9% 84.3% MuSE 93.1% 87.3% All Three 96.4% 91.2% GDC variant calling pipelines Wash U Baylor Broad GDC Data Harmonization Multiple pipelines needed to recover all variants
  43. 43. Development of the NCI Genomic Data Commons (GDC) To Foster the Molecular Diagnosis and Treatment of Cancer GDC Bob Grossman PI Univ. of Chicago Ontario Inst. Cancer Res. Leidos Institute of Medicine Towards Precision Medicine 2011
  44. 44. Discovery of Cancer Drivers With 2% Prevalence Lung adeno. + 2,900 Colorectal + 1,200 Ovarian + 500 Lawrence et al, Nature 2014 Power Calculation for Cancer Driver Discovery Need to resequence >100,000 tumors to identify all cancer drivers at >2% prevalence
  45. 45. The NCI Cancer Genomics Cloud Pilots Understanding how to meet the research community’s need to analyze large-scale cancer genomic and clinical data
  46. 46. 48 NCI Cancer Genomics Cloud Pilots Democratize access to NCI-generated genomic and related data, and to create a cost-effective way to provide scalable computational capacity to the cancer research community. Cloud Pilots provide: • Access to large genomic data sets without need to download • Access to popular pipelines and visualization tools • Ability for researchers to bring their own tools and pipelines to the data • Ability for researchers to bring their own data and analyze in combination with existing genomic data • Workspaces, for researchers to save and share their data and results of analyses
  47. 47. 49 • PI: Gad Getz • Google Cloud • Firehose in the cloud including Broad best practices workflows • Broad Institute • PI: Ilya Shmulevich • Google Cloud • Leverage Google infrastructure; Novel query and visualization • Institute for Systems Biology • PI: Deniz Kural • Amazon Web Services • Interactive data exploration; > 30 public pipelines • Seven Bridges Genomics Three NCI Genomics Cloud Pilots Selection Design/Build I Design/Build II Evaluation Extension Sept 2016Jan 2016April 2015Sept 2014 Jan 2014
  48. 48. Broad Institute Cloud Pilot • Targeted at users performing analyses at scale • Modeled after their Firehose analysis infrastructure developed for the TCGA program • Users can upload their own data and tools and/or run the Broad’s best practice tools and pipelines on pre-loaded data
  49. 49. Institute for Systems Biology Cloud Pilot 51 PI / Biologist web access Computational Research Scientist Python, R, SQL Algorithm Developer ssh, programmatic access ISB-CGC Web App Google Cloud Console Google APIs ISB-CGC APIs Compute Engine VMs Cloud Storage BigQuery Genomics Local Storage ISB-CGC Hosted Data Controlled-Access Data Open-Access Data User Data • Closely tied with Google Cloud Platform tools including BigQuery, App Engine, Cloud Datalab, Google Genomics, and Compute Engine • Aggregated TCGA data in BigQuery allows fast SQL-like queries across the entire dataset • Web interface allows scientists to interactively compare and define cohorts
  50. 50. Seven Bridges Genomics Cloud Pilot • Built upon the SBG commercial cloud-based genomics platform • Graphical query interface to identify hosted data of interest • Includes a native implementation of the Common Workflow Language specification and graphical interface for creating user- defined workflows
  51. 51. Workspace – isolated environment for collaborative analysis Data + Methods → Results sample data and metadata (e.g. BAMs, tissue type) algorithms (e.g. mutation calling) Wiring logic (e.g. use the exome capture BAM) executions and results (e.g. run mutation caller v41 on this exact bam and track results) Slide courtesy of Broad Institute
  52. 52. GDC Acknowledgements NCI Center for Cancer Genomics Univ. of Chicago Bob Grossman Allison Heath Mike Ford Zhenyu Zhang Ontario Institute for Cancer Research Lou Staudt Zhining Wang Martin Ferguson JC Zenklusen Daniela Gerhard Deb Steverson Vincent Ferretti 'Francois Gerthoffert JunJun Zhang Leidos Biomedical Research Mark Jensen Sharon Gaheen Himanso Sahni NCI NCI CBIIT Tony Kerlavage Tanya Davidsen
  53. 53. CGC Pilot Team Principal Investigators • Gad Getz, Ph.D - Broad Institute - • Ilya Shmulevich, Ph.D - ISB - • Deniz Kural, Ph.D - Seven Bridges – NCI Project Officer & CORs • Anthony Kerlavage, Ph.D –Project Officer • Juli Klemm, Ph.D – COR, Broad Institute • Tanja Davidsen, Ph.D – COR, Institute for Systems Biology • Ishwar Chandramouliswaran, MS, MBA – COR, Seven Bridges Genomics GDC Principal Investigator • Robert Grossman, Ph.D - University of Chicago • Allison Heath, Ph.D - University of Chicago • Vincent Ferretti, Ph.D - Ontario Institute for Cancer Research Cancer Genomics Project Teams NCI Leadership Team • Doug Lowy, M.D. • Lou Staudt, M.D., Ph.D. • Stephen Chanock, M.D. • George Komatsoulis, Ph.D. • Warren Kibbe, Ph.D. Center for Cancer Genomics Partners • JC Zenklusen, Ph.D. • Daniela Gerhard, Ph.D. • Zhining Wang, Ph.D. • Liming Yang, Ph.D. • Martin Ferguson, Ph.D.
  54. 54. 56 Integrated data sets, interoperable resources, harmonized data are necessary for and enable biologically informed cancer predictive models
  55. 55. 57 Questions? Warren Kibbe, Ph.D. @wakibbe
  56. 56.
  57. 57. 59 NIH Genomic Data Sharing Policy Went into effect January 25, 2015 NCI guidance: management/nci-policies/genomic-data Requires public sharing of genomic data sets