A foggy affair – sifting through
the „omics data flood
Wolfgang Hoeck
Principal Business Analyst – 09/23/2010

                                R&DI - Research & Development Informatics
Today‟s Presentation

 Terms and Domain Definitions
 The “problem space” from a scientists perspective
 Real and perceived bottlenecks
 Deluge of different „omics data types
 Data Summarization: Examples in the experimental
  data space
 Taking the next step in text: Adding context
 Bringing it all together – what does it mean?
 Concluding remarks

VIB Pharma - Laboratory Data Management
Conference USA                            2
„omics Data Definition and Drug Discovery

 The English-language neologism omics informally refers
  to a field of study in biology ending in –omic, such as
  genomics or proteomics. The related suffix –ome is used
  to address objects of study of such fields, such as the
  genome or proteome, respectively.
 Transcriptomics – the study of transcripts in a cell/tissue
 In a larger sense data representing: Gene Expression,
  Gene Amplification/Deletion, Gene Mutations
 In the context of cell lines and tissues integrated together
  to establish a picture of a disease and potential
  intervention points

VIB Pharma - Laboratory Data Management
Conference USA                            3
Drug Target Identification & Validation
          Structured Data                                      Semi- & Unstructured Data


        Experimental                                              Literature
            Data                                                     Data
Raw data available                 Analysis              Extraction        Raw data not or only
                                                                           partially available

• Profiling Data                      Target Identification
       • Gene Expression
       • Gene Copy Number
                                                                      • Target/Disease Associations
       • Gene Mutation
                                                                      • Specific experiment insight
• Functional Data
                                                                      • Publication Density
       • si/shRNA screens                  Target Validation
       • Transfections
       • Knock-outs

     Novel insights                                                    Confirmatory insights
                                     Therapeutic Molecule
 VIB Pharma - Laboratory Data Management
 Conference USA                              4
Key Issues In Oncology: highly diverse
    data sets in need of integration
    Numerous datasets not connected to each other: Datasets located on hard drives, local machines, servers
    Datasets not “written in the same language”: Distinctly different annotations & data formats
    No unified global interface/portal: Cannot see what datasets exist internally or externally, Cannot access
     datasets



                 X
                                NextGen              X                                   Copy
                               Sequence                                       X        Number
        Taq                                                  siRNA
                                                                                                             X
        Man                                                                   X
                                                 X
                                                                                              Drug
                            X                                    X                          Sensitivity
    X                                     Karyotype
                                                           X
                                                                        Images                 X
          Microarray                                  X
                                    X
                                                     Cell Line
                                                                          X       Used with permission from J.Argento
VIB Pharma - Laboratory Data Management              Profiling
Conference USA                               5
Technology Advances contributing to the
„omics data deluge

 Cheaper “old-world” technologies
      – Microarrays: More refined, widely available, cheaper
      – Analysis software: Standardized processing, more
        commercial choices, more in public domain

 Newer high-throughput technologies
      – NextGen Sequencing – from 30 bp to 120 bp: Revolutionary
        method to gain insight into genomics landscape of a sample
             • Transcriptomics (DGE, Digital Gene Expression): No pre-defined
               array necessary, very sensitive, detailed insight into a gene‟s
               transcriptome
             • Fusion Genes/Splice Variants: Detection of genes that normally
               don‟t fit together.
             • Alternative Transcripts: One gene, multiple versions
             • Genomics: Sequence variations among samples
                                          http://seqanswers.com/
VIB Pharma - Laboratory Data Management
Conference USA                            6
Academic “mega-Projects” contributing to
the „omics data deluge
 The Cancer Genome Atlas Project (TCGA) – A Rich
  Catalog of Human Cancer Genomes
      – 25 cancer types; 500+ samples each; microarray gene
        expression, amplification/deletion data; traditional and
        NextGen sequencing data; epigenetic data; clinical data; data
        at 4 levels of data reduction;

 The Sanger Cancer Genome Project – A rich
  characterization of cancer cell lines, tissues and
  responses
      – Catalog of somatic mutations in cancer cell lines and tissues;
        web tools, databases, downloads

 1000 Genomes Project – A Deep Catalog of Human
  Genetic Variation
       – Sequence 1000+ human genomes to detect sequence
           variations in population segments
VIB Pharma - Laboratory Data Management
Conference USA             7
Data Deluge – TCGA Ovarian Cancer
                  example
                                                                         Transcriptome                                                    Clinical

                         3 levels of data:                                                     Copy Number
                         • Raw, un processed
                         • Processed                                                                  Epigenetic Data
                         • Gene-level summaries                                                                         SNP

                                                                                                                                  Sequences
Samples (tumor, normal, cell line, 11/500+)




                                                                                                                    Screenshot taken from
                                              1 Sample (25000 data points, one kind of data)                        http://tcga-data.nci.nih.gov/tcga/

                  VIB Pharma - Laboratory Data Management
                  Conference USA                                               8
Where are the bottlenecks?
 Storage Space and Data Transfers: TB needed (already for microarray
     data), not just GB
      – Local storage: Attached to analysis computer, fast connectivity, stored raw
         data files and processed files
      – Centralized storage: Remote storage for sharing data across sites, data
         transfer speeds are an issue

 Analytical Skills and Computing Power:
      – Computing power: e.g.: 8 cpu, 16 GB Ram, 1.5 days to process one large
        scale copy number data set. How many CPUs can you put to work?
        Parallelizing the load on clusters does help
      – Human power: Still lots of data manipulation needed, domain knowledge is
        necessary

 Data sharing and integration capability:
      – Simplicity in user interface: Biologists are not computer scientists
      – Data cannot be shared as files: 1 gene copy number data set = 25 M rows of
        data.
      – Data needs summarization to enable broad surveys
      – Query performance: The faster the better, make users aware of what they are
        asking for.
VIB Pharma - Laboratory Data Management
Conference USA                            9
Who is affected by the bottlenecks?




                                                                          Increasing Level of summarization
 The Information Technologist
     – Cares about how much data needs to be stored where, how it
       will be transferred, how long it needs to be kept, what needs to
       be backed up, how big will the database get, etc.
 The Bioinformaticist
     – Cares about the raw data, needs powerful hardware, fast
       algorithms, plenty of storage space, means to share results
 The Scientist
     – Cares about analyzed results, wants immediate access, ability
       to interact with the data, ask specific questions about a gene,
       group of genes or samples (tissues, cell lines)
 The Manager
     – Cares about the final result – a novel, interesting, validated
       target ideally notified via automated e-mails

VIB Pharma - Laboratory Data Management
Conference USA                            10
Workflow/Data for one (1) sample in
NextGen Sequencing
  Microarray: 50,000+ probe set result values per sample
  NextGen Sequencing: 100M+ reads per sample




                          depth of read coverage at each position
Junctions.bed:            possible splice junctions
Digital_expression.txt:   normalized read counts (RPKM) with coverage statistics
Abnormal_Junction.bed:    possible junctions caused by translocation events
VIB Pharma - Laboratory Data Management
Conference USA                              11
Our appetite for data is enormous, but can
we digest it?

 Gene Expression: Microarrays, NextGen Sequencing
 Gene Copy Number Variations: Microarrays
 Gene Mutations: Classic and NextGen Sequencing
 Gene Methylations: Microarray
 Cell Line Panels Response Profiles: Plate-based
 siRNA screens: gene library screens
 Phenotypic screens: Knock-out animals

  Or do we all need a lifetime supply of indigestion pills?


VIB Pharma - Laboratory Data Management
Conference USA                            12
Is cloud computing a solution?

 Many Software-as-a-Service (SaaS) vendors
      –    Compendia Bioscience: Oncomine, Oncomine Power Tools
      –    NextBio: NextBio Basic, Professional, Enterprise
      –    GenomeQuest: NGS Data Management
      –    DNAnexus: NGS Data Management

 Keeping all the data that need integration in one place
  makes life a lot easier
 No internal IT resources required
 However, companies subscribing to these services
  lose out on building an internal knowledgebase
 Is this then a solution to this problem? Partially!!
VIB Pharma - Laboratory Data Management
Conference USA                            13
Oncomine Power Tools




                                               Used with permission from Compendia Bioscience
VIB Pharma - Laboratory Data Management
Conference USA                            14
Data Reduction & Interactive
Visualizations – the key to success?

 Don‟t start in the weeds – take a step back: Create
  summarizations; e.g.: Summarize at Gene level to
  enable systematic surveys
 However, enable digging into the weeds: Select a
  Gene and view the details – spread of sample values,
  sequence coverage of a gene‟s exon
 Make it interactive at every level: Search for Gene
  lists, enable filtering by annotations (cellular location,
  target class, pathways, etc.)
 Clearly define what type of data you are dealing with:
  identity and annotations are critical

VIB Pharma - Laboratory Data Management
Conference USA                            15
Internal Prototyping Workflow
Operational Layer                                      Knowledge Layer

                                                                           Query &
                                                                           Visualize



                          Data Mapping
                              Import                   Information Links
                                           Profiling
                                            Data
                                          Warehouse
           Data
          Analysis




                                                                                       …
VIB Pharma - Laboratory Data Management
Conference USA                            16
Molecular Profiling Database – Gene
Expression Data




                                               Filters based on available sample & gene annotations
Log2 difference
Summary Table
Details




VIB Pharma - Laboratory Data Management
Conference USA                            17
Molecular Profiling Database – Gene
Expression Data
                                                                          Multiple Visualizations




                                                                                                    Filters based on available sample & gene annotations
 Sample Spread (Scatter Plot)




                                               Sample Spread (Box Plot)




VIB Pharma - Laboratory Data Management
Conference USA                            18
High level visualization of gene mutations
in cell lines




Tissue and                                     Gene
Tumor                                          mutations in
categorizations                                specific cell
                                               lines




VIB Pharma - Laboratory Data Management
Conference USA                            19
A Target Prioritization Tool
                                                                Input from Multiple Data Sources
List of Potential Targets (Target Classes)




                                               Gene             Gene         Gene           Tool               siRNA      Knockout
                                             Expression      Copy Number    Mutation     Compounds           Functional   Functional



                                              Scores             Scores     Scores         Scores             Scores       Scores




                                             Prioritized Target List #1         Prioritized Target List #2

              VIB Pharma - Laboratory Data Management
              Conference USA                                          20
Concluding remarks/Acknowledgements

 We will get more data.
 If we fail to organize and summarize, we‟ll waste a lot
  of time and money
 Some standards will be necessary, we‟ll have to
  compromise at times
 Not everything can and will be in one place. Defined
  interfaces (data, processes) between disparate
  systems are desperately needed to enable data
  interchange.
 My colleagues in Research Informatics &
  Hematology/Oncology Therapeutic Area
VIB Pharma - Laboratory Data Management
Conference USA                            21

Dmla0910 – Hoeck– Presentation

  • 1.
    A foggy affair– sifting through the „omics data flood Wolfgang Hoeck Principal Business Analyst – 09/23/2010 R&DI - Research & Development Informatics
  • 2.
    Today‟s Presentation  Termsand Domain Definitions  The “problem space” from a scientists perspective  Real and perceived bottlenecks  Deluge of different „omics data types  Data Summarization: Examples in the experimental data space  Taking the next step in text: Adding context  Bringing it all together – what does it mean?  Concluding remarks VIB Pharma - Laboratory Data Management Conference USA 2
  • 3.
    „omics Data Definitionand Drug Discovery  The English-language neologism omics informally refers to a field of study in biology ending in –omic, such as genomics or proteomics. The related suffix –ome is used to address objects of study of such fields, such as the genome or proteome, respectively.  Transcriptomics – the study of transcripts in a cell/tissue  In a larger sense data representing: Gene Expression, Gene Amplification/Deletion, Gene Mutations  In the context of cell lines and tissues integrated together to establish a picture of a disease and potential intervention points VIB Pharma - Laboratory Data Management Conference USA 3
  • 4.
    Drug Target Identification& Validation Structured Data Semi- & Unstructured Data Experimental Literature Data Data Raw data available Analysis Extraction Raw data not or only partially available • Profiling Data Target Identification • Gene Expression • Gene Copy Number • Target/Disease Associations • Gene Mutation • Specific experiment insight • Functional Data • Publication Density • si/shRNA screens Target Validation • Transfections • Knock-outs Novel insights Confirmatory insights Therapeutic Molecule VIB Pharma - Laboratory Data Management Conference USA 4
  • 5.
    Key Issues InOncology: highly diverse data sets in need of integration  Numerous datasets not connected to each other: Datasets located on hard drives, local machines, servers  Datasets not “written in the same language”: Distinctly different annotations & data formats  No unified global interface/portal: Cannot see what datasets exist internally or externally, Cannot access datasets X NextGen X Copy Sequence X Number Taq siRNA X Man X X Drug X X Sensitivity X Karyotype X Images X Microarray X X Cell Line X Used with permission from J.Argento VIB Pharma - Laboratory Data Management Profiling Conference USA 5
  • 6.
    Technology Advances contributingto the „omics data deluge  Cheaper “old-world” technologies – Microarrays: More refined, widely available, cheaper – Analysis software: Standardized processing, more commercial choices, more in public domain  Newer high-throughput technologies – NextGen Sequencing – from 30 bp to 120 bp: Revolutionary method to gain insight into genomics landscape of a sample • Transcriptomics (DGE, Digital Gene Expression): No pre-defined array necessary, very sensitive, detailed insight into a gene‟s transcriptome • Fusion Genes/Splice Variants: Detection of genes that normally don‟t fit together. • Alternative Transcripts: One gene, multiple versions • Genomics: Sequence variations among samples http://seqanswers.com/ VIB Pharma - Laboratory Data Management Conference USA 6
  • 7.
    Academic “mega-Projects” contributingto the „omics data deluge  The Cancer Genome Atlas Project (TCGA) – A Rich Catalog of Human Cancer Genomes – 25 cancer types; 500+ samples each; microarray gene expression, amplification/deletion data; traditional and NextGen sequencing data; epigenetic data; clinical data; data at 4 levels of data reduction;  The Sanger Cancer Genome Project – A rich characterization of cancer cell lines, tissues and responses – Catalog of somatic mutations in cancer cell lines and tissues; web tools, databases, downloads  1000 Genomes Project – A Deep Catalog of Human Genetic Variation – Sequence 1000+ human genomes to detect sequence variations in population segments VIB Pharma - Laboratory Data Management Conference USA 7
  • 8.
    Data Deluge –TCGA Ovarian Cancer example Transcriptome Clinical 3 levels of data: Copy Number • Raw, un processed • Processed Epigenetic Data • Gene-level summaries SNP Sequences Samples (tumor, normal, cell line, 11/500+) Screenshot taken from 1 Sample (25000 data points, one kind of data) http://tcga-data.nci.nih.gov/tcga/ VIB Pharma - Laboratory Data Management Conference USA 8
  • 9.
    Where are thebottlenecks?  Storage Space and Data Transfers: TB needed (already for microarray data), not just GB – Local storage: Attached to analysis computer, fast connectivity, stored raw data files and processed files – Centralized storage: Remote storage for sharing data across sites, data transfer speeds are an issue  Analytical Skills and Computing Power: – Computing power: e.g.: 8 cpu, 16 GB Ram, 1.5 days to process one large scale copy number data set. How many CPUs can you put to work? Parallelizing the load on clusters does help – Human power: Still lots of data manipulation needed, domain knowledge is necessary  Data sharing and integration capability: – Simplicity in user interface: Biologists are not computer scientists – Data cannot be shared as files: 1 gene copy number data set = 25 M rows of data. – Data needs summarization to enable broad surveys – Query performance: The faster the better, make users aware of what they are asking for. VIB Pharma - Laboratory Data Management Conference USA 9
  • 10.
    Who is affectedby the bottlenecks? Increasing Level of summarization  The Information Technologist – Cares about how much data needs to be stored where, how it will be transferred, how long it needs to be kept, what needs to be backed up, how big will the database get, etc.  The Bioinformaticist – Cares about the raw data, needs powerful hardware, fast algorithms, plenty of storage space, means to share results  The Scientist – Cares about analyzed results, wants immediate access, ability to interact with the data, ask specific questions about a gene, group of genes or samples (tissues, cell lines)  The Manager – Cares about the final result – a novel, interesting, validated target ideally notified via automated e-mails VIB Pharma - Laboratory Data Management Conference USA 10
  • 11.
    Workflow/Data for one(1) sample in NextGen Sequencing Microarray: 50,000+ probe set result values per sample NextGen Sequencing: 100M+ reads per sample depth of read coverage at each position Junctions.bed: possible splice junctions Digital_expression.txt: normalized read counts (RPKM) with coverage statistics Abnormal_Junction.bed: possible junctions caused by translocation events VIB Pharma - Laboratory Data Management Conference USA 11
  • 12.
    Our appetite fordata is enormous, but can we digest it?  Gene Expression: Microarrays, NextGen Sequencing  Gene Copy Number Variations: Microarrays  Gene Mutations: Classic and NextGen Sequencing  Gene Methylations: Microarray  Cell Line Panels Response Profiles: Plate-based  siRNA screens: gene library screens  Phenotypic screens: Knock-out animals Or do we all need a lifetime supply of indigestion pills? VIB Pharma - Laboratory Data Management Conference USA 12
  • 13.
    Is cloud computinga solution?  Many Software-as-a-Service (SaaS) vendors – Compendia Bioscience: Oncomine, Oncomine Power Tools – NextBio: NextBio Basic, Professional, Enterprise – GenomeQuest: NGS Data Management – DNAnexus: NGS Data Management  Keeping all the data that need integration in one place makes life a lot easier  No internal IT resources required  However, companies subscribing to these services lose out on building an internal knowledgebase  Is this then a solution to this problem? Partially!! VIB Pharma - Laboratory Data Management Conference USA 13
  • 14.
    Oncomine Power Tools Used with permission from Compendia Bioscience VIB Pharma - Laboratory Data Management Conference USA 14
  • 15.
    Data Reduction &Interactive Visualizations – the key to success?  Don‟t start in the weeds – take a step back: Create summarizations; e.g.: Summarize at Gene level to enable systematic surveys  However, enable digging into the weeds: Select a Gene and view the details – spread of sample values, sequence coverage of a gene‟s exon  Make it interactive at every level: Search for Gene lists, enable filtering by annotations (cellular location, target class, pathways, etc.)  Clearly define what type of data you are dealing with: identity and annotations are critical VIB Pharma - Laboratory Data Management Conference USA 15
  • 16.
    Internal Prototyping Workflow OperationalLayer Knowledge Layer Query & Visualize Data Mapping Import Information Links Profiling Data Warehouse Data Analysis … VIB Pharma - Laboratory Data Management Conference USA 16
  • 17.
    Molecular Profiling Database– Gene Expression Data Filters based on available sample & gene annotations Log2 difference Summary Table Details VIB Pharma - Laboratory Data Management Conference USA 17
  • 18.
    Molecular Profiling Database– Gene Expression Data Multiple Visualizations Filters based on available sample & gene annotations Sample Spread (Scatter Plot) Sample Spread (Box Plot) VIB Pharma - Laboratory Data Management Conference USA 18
  • 19.
    High level visualizationof gene mutations in cell lines Tissue and Gene Tumor mutations in categorizations specific cell lines VIB Pharma - Laboratory Data Management Conference USA 19
  • 20.
    A Target PrioritizationTool Input from Multiple Data Sources List of Potential Targets (Target Classes) Gene Gene Gene Tool siRNA Knockout Expression Copy Number Mutation Compounds Functional Functional Scores Scores Scores Scores Scores Scores Prioritized Target List #1 Prioritized Target List #2 VIB Pharma - Laboratory Data Management Conference USA 20
  • 21.
    Concluding remarks/Acknowledgements  Wewill get more data.  If we fail to organize and summarize, we‟ll waste a lot of time and money  Some standards will be necessary, we‟ll have to compromise at times  Not everything can and will be in one place. Defined interfaces (data, processes) between disparate systems are desperately needed to enable data interchange.  My colleagues in Research Informatics & Hematology/Oncology Therapeutic Area VIB Pharma - Laboratory Data Management Conference USA 21