1. Facilitating target candidate prioritization via integrated,
interactive visualizations of molecular profiling data
Wolfgang Hoeck, Ph.D., Research Informatics, Amgen Inc.
2. Topics for today’s presentation
• What is Molecular Profiling Data?
• The problem of sharing large volume data
• Sending files isn’t working well
• Public molecular profiling efforts
• The Cancer Genome Atlas
• Sanger COSMIC
• Broad CCLE
• TARO - an integrated database plus interactive visualizations
• Identities and Standard Terminologies (Taxonomies)
• Commercial molecular profiling data repositories
• Leveraging internal and external efforts
• Pulling everything together
• Closing thoughts
2/5/2012 Wolfgang Hoeck 2
3. Molecular Profiling Data as a source of potential Targets
• What is Molecular Profiling Data?
• High volume data (millions of data points) measuring genomic or
transcriptomic end points
• Gene Expression: How much of my gene is expressed under a certain
condition?
• Comparing gene expression of two groups – Normal/Tumor or Tumor/Tumor
• Surveying a panel of normal tissues
• Gene Copy Number: How many copies of my gene are present in the
genome?
• Which genes are contained in an amplified region of a chromosome?
• Is a gene or gene family amplified or deleted in a given tumor setting?
• Can we validate the copy number status in an independent dataset?
• Somatic Mutations: Is my gene normal or mutated?
• Is the gene clearly mutated or is there conflicting evidence?
• Are mutations affecting genes in the same pathway?
2/5/2012 Wolfgang Hoeck 3
4. Multiple Genomic Data Types lead to a list of possible targets
List of Targets (Target Classes)
Micro- RNA- CGH SNP Exome RNA- ChIP ChIP
Array seq Array Array Sequencing seq -seq -chip
Gene Gene Gene Gene
Methylation
Expression Copy Number Mutation Fusion
Scores Scores Scores Scores Scores
Prioritized Target List #1 Prioritized Target List #2
2/5/2012 Wolfgang Hoeck 4
5. Public and Commercial Molecular Profiling Efforts
Source Name Content Data Type Value
NCI The Cancer Genome 20+ tumor types, Gene Expression Target Identification
Atlas (TCGA) 500+ samples each (uA, NGS), Copy & Validation,
Number, Clinical Patient
Data Stratification
Sanger Cancer Genome COSMIC Somatic Mutation Target Identification
Wellcome Trust Project (CGP) Data & Validation,
Patient
Stratification,
Model Selection
Broad Institute Cancer Cell Line 800+ Cancer Cell Gene Expression Target Identification
Encyclopedia (CCLE) Lines (uA), Copy Number & Validation, Model
(uA) Selection
GSK-caBIG Wooster Cell Line Panel 300+ Cancer Cell Gene Expression Target Identification
Lines (uA), Copy Number & Validation, Model
(uA) Selection
RICERCA OncoPanel 240 Cancer Cell Gene Expression Target Identification
Lines (uA) & Validation, Model
Selection
2/5/2012 Wolfgang Hoeck 5
6. TARO Data Sharing Solution Strategy
• Data type focused
• Gene Expression, Copy Number and Somatic Mutations
• Technology Independent
• Data from Microarray, NextGen Sequencing, Sanger Sequencing, etc.
• Source Independent
• Data comes from multiple sources: Amgen, TCGA, Broad, Sanger, Publications
• Data Standardization enables integration at multiple Levels
• Gene, Tissue, Disease, Sample (Tissue Sample/ Cell Line Sample)
• Modular Development
• Independent Database
• Support Multiple User Interfaces
• Visualization UI
• Central Research Discovery Tool
• Web Services
2/5/2012 Wolfgang Hoeck 6
7. TARO Use Cases
Target Identification
Target Validation
Model Selection
• Target Identification:
• Systematically identify targets via differential-expression and/or copy number in
one or multiple tissue datasets
• Target Validation:
• Validate target expression in independent tissue data sets
• Verify target expression across many normal and diseased tissue types to
determine tissue specificity and potential off-target effects
• Model Selection
• Identify cell line model that highly or lowly expresses target of interest
• Identify cell line model that contains target gene amplification
• Provide mutation data on typical genes within selected cell lines to highlight
mutational background
• Identify cell lines with a specific mutation pattern (e.g.: EGFR mut and KRAS wt)
2/5/2012 Wolfgang Hoeck 7
8. Layering the Information Landscape
Decision
Support
Query tools for Amgen scientists
to search across internal and
external data repositories
Research TARO-Guides
Gateway
Convergence
Centralizes and organizes the
storage of ‘Omics data for
bioinformaticists and biologists
alike
Omics Repository TARO Data Mart
Transactional
Summ
arize
Aggre
Operational systems to handle
gate
the day-to-day execution of
Normalize
‘Omics experiments and their
initial analysis
Experiments Omics Analysis
Tissue Fulfills the baseline
Reference
Disease requirements for biology
Data
Organism identity / reference data
Cell Line systems. W/o these systems
none of the above is possible.
Gene Index Research Taxonomy
Foundation (RTF)
2/5/2012 Wolfgang Hoeck 8
9. Take it apart, standardize, then connect and integrate …
2/5/2012 Wolfgang Hoeck 9
10. TARO Guide Collection – covering the spectrum from summaries
to details
Interactive Visualizations in Spotfire
Client or Webplayer
• Gene Expression
– Gene-level or probe-set level
– Panels or Comparisons
• Copy Number
– Whole chromosome view
– Detail per sample
• Somatic Mutations
– 1700+ cancer cell lines
– COSMIC and other mutation
data
2/5/2012 Wolfgang Hoeck 10
11. Surveying the mutation landscape in Cancer Cell Lines
Standard Gene Symbols
Standard Canonical Cell Line Name
Standard Mutation Nomenclature
2/5/2012 Wolfgang Hoeck 11
13. Successes and Shortcomings of TARO
• Ideal for pointed questions
• Show me the expression, copy number and mutation status of Gene X
• Generate a list of differentially expressed genes for upload into NextBio
• Identify cell lines with a particular mutation profile
• Great for data important to Amgen
• Provides a foundation for accumulating knowledge
• Shortcomings
• Breadth of data is resource-limited
• Data isn’t always available immediately, curation takes time
• Complexity of data space, capability vs. simplicity
• There is still some learning involved for scientists
• Chosen technology doesn’t always allow the desired User Interface
2/5/2012 Wolfgang Hoeck 13
14. Commercial Molecular Profiling Data Repositories
• Oncomine and Oncomine Power Tools (OPT)
• Organizing and annotating oncology data in a consistent fashion
• Oncomine Enterprise: Web user interface, enabling customer data uploads
• OPT: Integrated Gene Browser - Bringing multiple data-types together in a
summary view
• NextBio
• NextBio Enterprise: Web user interface, enabling customer data uploads
• Multiple Apps for variety of profiling data
• Includes literature data
• Provides Meta Analysis: Surveying studies across multiple sources
2/5/2012 Wolfgang Hoeck 14
17. Where do we go from here?
• Why do this in the first place?
• Better informed decisions
• Achieve higher throughput, consider more targets
• Help in understanding the complexity of the landscape
• We are starting to see the fruits of “semantic integration efforts”
• Ad-hoc integration with stand-alone profiling data of different data types
becomes much easier (e.g.: Phosphoprotein Arrays)
• Utilization of other public profiling datasets is easier (e.g.: from publications)
• Migrating into the “screening data” space (e.g.: compound-treated cell line
panels) now becomes possible
• In-House Challenges: Domain knowledge for curation, presentation of
complex data in limited space, Database Performance – can we make it
good enough?
• Vendor Challenges: Interfaces for Integration,
• Balance knowledge management efforts: Are we just data collectors? But
wait, there is more ….
2/5/2012 Wolfgang Hoeck 17
18. Acknowledgements
• Interdisciplinary team work
• Database Designers
• Database Administrators
• System Administrators
• Business Analysts
• Scientists
• Bioinformaticists
• Support Analysts
• Project Manager
NONE OF THIS WOULD BE POSSIBLE
WITHOUT TEAM WORK
2/5/2012 Wolfgang Hoeck 18
19. THANK YOU FOR YOUR TIME
2/5/2012 Wolfgang Hoeck 19