Facilitating target candidate prioritization via integrated,interactive visualizations of molecular profiling dataWolfgang Hoeck, Ph.D., Research Informatics, Amgen Inc.
Topics for today’s presentation • What is Molecular Profiling Data? • The problem of sharing large volume data • Sending files isn’t working well • Public molecular profiling efforts • The Cancer Genome Atlas • Sanger COSMIC • Broad CCLE • TARO - an integrated database plus interactive visualizations • Identities and Standard Terminologies (Taxonomies) • Commercial molecular profiling data repositories • Leveraging internal and external efforts • Pulling everything together • Closing thoughts2/5/2012 Wolfgang Hoeck 2
Molecular Profiling Data as a source of potential Targets • What is Molecular Profiling Data? • High volume data (millions of data points) measuring genomic or transcriptomic end points • Gene Expression: How much of my gene is expressed under a certain condition? • Comparing gene expression of two groups – Normal/Tumor or Tumor/Tumor • Surveying a panel of normal tissues • Gene Copy Number: How many copies of my gene are present in the genome? • Which genes are contained in an amplified region of a chromosome? • Is a gene or gene family amplified or deleted in a given tumor setting? • Can we validate the copy number status in an independent dataset? • Somatic Mutations: Is my gene normal or mutated? • Is the gene clearly mutated or is there conflicting evidence? • Are mutations affecting genes in the same pathway?2/5/2012 Wolfgang Hoeck 3
Multiple Genomic Data Types lead to a list of possible targetsList of Targets (Target Classes) Micro- RNA- CGH SNP Exome RNA- ChIP ChIP Array seq Array Array Sequencing seq -seq -chip Gene Gene Gene Gene Methylation Expression Copy Number Mutation Fusion Scores Scores Scores Scores Scores Prioritized Target List #1 Prioritized Target List #2 2/5/2012 Wolfgang Hoeck 4
Public and Commercial Molecular Profiling EffortsSource Name Content Data Type ValueNCI The Cancer Genome 20+ tumor types, Gene Expression Target Identification Atlas (TCGA) 500+ samples each (uA, NGS), Copy & Validation, Number, Clinical Patient Data StratificationSanger Cancer Genome COSMIC Somatic Mutation Target IdentificationWellcome Trust Project (CGP) Data & Validation, Patient Stratification, Model SelectionBroad Institute Cancer Cell Line 800+ Cancer Cell Gene Expression Target Identification Encyclopedia (CCLE) Lines (uA), Copy Number & Validation, Model (uA) SelectionGSK-caBIG Wooster Cell Line Panel 300+ Cancer Cell Gene Expression Target Identification Lines (uA), Copy Number & Validation, Model (uA) SelectionRICERCA OncoPanel 240 Cancer Cell Gene Expression Target Identification Lines (uA) & Validation, Model Selection2/5/2012 Wolfgang Hoeck 5
TARO Data Sharing Solution Strategy • Data type focused • Gene Expression, Copy Number and Somatic Mutations • Technology Independent • Data from Microarray, NextGen Sequencing, Sanger Sequencing, etc. • Source Independent • Data comes from multiple sources: Amgen, TCGA, Broad, Sanger, Publications • Data Standardization enables integration at multiple Levels • Gene, Tissue, Disease, Sample (Tissue Sample/ Cell Line Sample) • Modular Development • Independent Database • Support Multiple User Interfaces • Visualization UI • Central Research Discovery Tool • Web Services2/5/2012 Wolfgang Hoeck 6
TARO Use Cases Target Identification Target Validation Model Selection • Target Identification: • Systematically identify targets via differential-expression and/or copy number in one or multiple tissue datasets • Target Validation: • Validate target expression in independent tissue data sets • Verify target expression across many normal and diseased tissue types to determine tissue specificity and potential off-target effects • Model Selection • Identify cell line model that highly or lowly expresses target of interest • Identify cell line model that contains target gene amplification • Provide mutation data on typical genes within selected cell lines to highlight mutational background • Identify cell lines with a specific mutation pattern (e.g.: EGFR mut and KRAS wt)2/5/2012 Wolfgang Hoeck 7
Layering the Information Landscape Decision Support Query tools for Amgen scientists to search across internal and external data repositories Research TARO-Guides Gateway Convergence Centralizes and organizes the storage of ‘Omics data for bioinformaticists and biologists alike Omics Repository TARO Data Mart Transactional Summ arize Aggre Operational systems to handle gate the day-to-day execution of Normalize ‘Omics experiments and their initial analysis Experiments Omics Analysis Tissue Fulfills the baseline Reference Disease requirements for biology Data Organism identity / reference data Cell Line systems. W/o these systems none of the above is possible. Gene Index Research Taxonomy Foundation (RTF)2/5/2012 Wolfgang Hoeck 8
Take it apart, standardize, then connect and integrate …2/5/2012 Wolfgang Hoeck 9
TARO Guide Collection – covering the spectrum from summaries to detailsInteractive Visualizations in SpotfireClient or Webplayer• Gene Expression – Gene-level or probe-set level – Panels or Comparisons• Copy Number – Whole chromosome view – Detail per sample• Somatic Mutations – 1700+ cancer cell lines – COSMIC and other mutation data2/5/2012 Wolfgang Hoeck 10
Surveying the mutation landscape in Cancer Cell Lines Standard Gene SymbolsStandard Canonical Cell Line Name Standard Mutation Nomenclature 2/5/2012 Wolfgang Hoeck 11
Integrating Expression and Mutation Data2/5/2012 Wolfgang Hoeck 12
Successes and Shortcomings of TARO • Ideal for pointed questions • Show me the expression, copy number and mutation status of Gene X • Generate a list of differentially expressed genes for upload into NextBio • Identify cell lines with a particular mutation profile • Great for data important to Amgen • Provides a foundation for accumulating knowledge • Shortcomings • Breadth of data is resource-limited • Data isn’t always available immediately, curation takes time • Complexity of data space, capability vs. simplicity • There is still some learning involved for scientists • Chosen technology doesn’t always allow the desired User Interface2/5/2012 Wolfgang Hoeck 13
Commercial Molecular Profiling Data Repositories • Oncomine and Oncomine Power Tools (OPT) • Organizing and annotating oncology data in a consistent fashion • Oncomine Enterprise: Web user interface, enabling customer data uploads • OPT: Integrated Gene Browser - Bringing multiple data-types together in a summary view • NextBio • NextBio Enterprise: Web user interface, enabling customer data uploads • Multiple Apps for variety of profiling data • Includes literature data • Provides Meta Analysis: Surveying studies across multiple sources2/5/2012 Wolfgang Hoeck 14
Integrated Gene Browser – Oncomine Power Tools2/5/2012 Wolfgang Hoeck 15
BodyAtlas Cell Lines – NextBio2/5/2012 Wolfgang Hoeck 16
Where do we go from here? • Why do this in the first place? • Better informed decisions • Achieve higher throughput, consider more targets • Help in understanding the complexity of the landscape • We are starting to see the fruits of “semantic integration efforts” • Ad-hoc integration with stand-alone profiling data of different data types becomes much easier (e.g.: Phosphoprotein Arrays) • Utilization of other public profiling datasets is easier (e.g.: from publications) • Migrating into the “screening data” space (e.g.: compound-treated cell line panels) now becomes possible • In-House Challenges: Domain knowledge for curation, presentation of complex data in limited space, Database Performance – can we make it good enough? • Vendor Challenges: Interfaces for Integration, • Balance knowledge management efforts: Are we just data collectors? But wait, there is more ….2/5/2012 Wolfgang Hoeck 17
Acknowledgements • Interdisciplinary team work • Database Designers • Database Administrators • System Administrators • Business Analysts • Scientists • Bioinformaticists • Support Analysts • Project Manager NONE OF THIS WOULD BE POSSIBLE WITHOUT TEAM WORK2/5/2012 Wolfgang Hoeck 18
THANK YOU FOR YOUR TIME 2/5/2012 Wolfgang Hoeck 19
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.