Dmla0910 – Hoeck– Presentation


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Dmla0910 – Hoeck– Presentation

  1. 1. A foggy affair – sifting throughthe „omics data floodWolfgang HoeckPrincipal Business Analyst – 09/23/2010 R&DI - Research & Development Informatics
  2. 2. Today‟s Presentation Terms and Domain Definitions The “problem space” from a scientists perspective Real and perceived bottlenecks Deluge of different „omics data types Data Summarization: Examples in the experimental data space Taking the next step in text: Adding context Bringing it all together – what does it mean? Concluding remarksVIB Pharma - Laboratory Data ManagementConference USA 2
  3. 3. „omics Data Definition and Drug Discovery The English-language neologism omics informally refers to a field of study in biology ending in –omic, such as genomics or proteomics. The related suffix –ome is used to address objects of study of such fields, such as the genome or proteome, respectively. Transcriptomics – the study of transcripts in a cell/tissue In a larger sense data representing: Gene Expression, Gene Amplification/Deletion, Gene Mutations In the context of cell lines and tissues integrated together to establish a picture of a disease and potential intervention pointsVIB Pharma - Laboratory Data ManagementConference USA 3
  4. 4. Drug Target Identification & Validation Structured Data Semi- & Unstructured Data Experimental Literature Data DataRaw data available Analysis Extraction Raw data not or only partially available• Profiling Data Target Identification • Gene Expression • Gene Copy Number • Target/Disease Associations • Gene Mutation • Specific experiment insight• Functional Data • Publication Density • si/shRNA screens Target Validation • Transfections • Knock-outs Novel insights Confirmatory insights Therapeutic Molecule VIB Pharma - Laboratory Data Management Conference USA 4
  5. 5. Key Issues In Oncology: highly diverse data sets in need of integration Numerous datasets not connected to each other: Datasets located on hard drives, local machines, servers Datasets not “written in the same language”: Distinctly different annotations & data formats No unified global interface/portal: Cannot see what datasets exist internally or externally, Cannot access datasets X NextGen X Copy Sequence X Number Taq siRNA X Man X X Drug X X Sensitivity X Karyotype X Images X Microarray X X Cell Line X Used with permission from J.ArgentoVIB Pharma - Laboratory Data Management ProfilingConference USA 5
  6. 6. Technology Advances contributing to the„omics data deluge Cheaper “old-world” technologies – Microarrays: More refined, widely available, cheaper – Analysis software: Standardized processing, more commercial choices, more in public domain Newer high-throughput technologies – NextGen Sequencing – from 30 bp to 120 bp: Revolutionary method to gain insight into genomics landscape of a sample • Transcriptomics (DGE, Digital Gene Expression): No pre-defined array necessary, very sensitive, detailed insight into a gene‟s transcriptome • Fusion Genes/Splice Variants: Detection of genes that normally don‟t fit together. • Alternative Transcripts: One gene, multiple versions • Genomics: Sequence variations among samples Pharma - Laboratory Data ManagementConference USA 6
  7. 7. Academic “mega-Projects” contributing tothe „omics data deluge The Cancer Genome Atlas Project (TCGA) – A Rich Catalog of Human Cancer Genomes – 25 cancer types; 500+ samples each; microarray gene expression, amplification/deletion data; traditional and NextGen sequencing data; epigenetic data; clinical data; data at 4 levels of data reduction; The Sanger Cancer Genome Project – A rich characterization of cancer cell lines, tissues and responses – Catalog of somatic mutations in cancer cell lines and tissues; web tools, databases, downloads 1000 Genomes Project – A Deep Catalog of Human Genetic Variation – Sequence 1000+ human genomes to detect sequence variations in population segmentsVIB Pharma - Laboratory Data ManagementConference USA 7
  8. 8. Data Deluge – TCGA Ovarian Cancer example Transcriptome Clinical 3 levels of data: Copy Number • Raw, un processed • Processed Epigenetic Data • Gene-level summaries SNP SequencesSamples (tumor, normal, cell line, 11/500+) Screenshot taken from 1 Sample (25000 data points, one kind of data) VIB Pharma - Laboratory Data Management Conference USA 8
  9. 9. Where are the bottlenecks? Storage Space and Data Transfers: TB needed (already for microarray data), not just GB – Local storage: Attached to analysis computer, fast connectivity, stored raw data files and processed files – Centralized storage: Remote storage for sharing data across sites, data transfer speeds are an issue Analytical Skills and Computing Power: – Computing power: e.g.: 8 cpu, 16 GB Ram, 1.5 days to process one large scale copy number data set. How many CPUs can you put to work? Parallelizing the load on clusters does help – Human power: Still lots of data manipulation needed, domain knowledge is necessary Data sharing and integration capability: – Simplicity in user interface: Biologists are not computer scientists – Data cannot be shared as files: 1 gene copy number data set = 25 M rows of data. – Data needs summarization to enable broad surveys – Query performance: The faster the better, make users aware of what they are asking for.VIB Pharma - Laboratory Data ManagementConference USA 9
  10. 10. Who is affected by the bottlenecks? Increasing Level of summarization The Information Technologist – Cares about how much data needs to be stored where, how it will be transferred, how long it needs to be kept, what needs to be backed up, how big will the database get, etc. The Bioinformaticist – Cares about the raw data, needs powerful hardware, fast algorithms, plenty of storage space, means to share results The Scientist – Cares about analyzed results, wants immediate access, ability to interact with the data, ask specific questions about a gene, group of genes or samples (tissues, cell lines) The Manager – Cares about the final result – a novel, interesting, validated target ideally notified via automated e-mailsVIB Pharma - Laboratory Data ManagementConference USA 10
  11. 11. Workflow/Data for one (1) sample inNextGen Sequencing Microarray: 50,000+ probe set result values per sample NextGen Sequencing: 100M+ reads per sample depth of read coverage at each positionJunctions.bed: possible splice junctionsDigital_expression.txt: normalized read counts (RPKM) with coverage statisticsAbnormal_Junction.bed: possible junctions caused by translocation eventsVIB Pharma - Laboratory Data ManagementConference USA 11
  12. 12. Our appetite for data is enormous, but canwe digest it? Gene Expression: Microarrays, NextGen Sequencing Gene Copy Number Variations: Microarrays Gene Mutations: Classic and NextGen Sequencing Gene Methylations: Microarray Cell Line Panels Response Profiles: Plate-based siRNA screens: gene library screens Phenotypic screens: Knock-out animals Or do we all need a lifetime supply of indigestion pills?VIB Pharma - Laboratory Data ManagementConference USA 12
  13. 13. Is cloud computing a solution? Many Software-as-a-Service (SaaS) vendors – Compendia Bioscience: Oncomine, Oncomine Power Tools – NextBio: NextBio Basic, Professional, Enterprise – GenomeQuest: NGS Data Management – DNAnexus: NGS Data Management Keeping all the data that need integration in one place makes life a lot easier No internal IT resources required However, companies subscribing to these services lose out on building an internal knowledgebase Is this then a solution to this problem? Partially!!VIB Pharma - Laboratory Data ManagementConference USA 13
  14. 14. Oncomine Power Tools Used with permission from Compendia BioscienceVIB Pharma - Laboratory Data ManagementConference USA 14
  15. 15. Data Reduction & InteractiveVisualizations – the key to success? Don‟t start in the weeds – take a step back: Create summarizations; e.g.: Summarize at Gene level to enable systematic surveys However, enable digging into the weeds: Select a Gene and view the details – spread of sample values, sequence coverage of a gene‟s exon Make it interactive at every level: Search for Gene lists, enable filtering by annotations (cellular location, target class, pathways, etc.) Clearly define what type of data you are dealing with: identity and annotations are criticalVIB Pharma - Laboratory Data ManagementConference USA 15
  16. 16. Internal Prototyping WorkflowOperational Layer Knowledge Layer Query & Visualize Data Mapping Import Information Links Profiling Data Warehouse Data Analysis …VIB Pharma - Laboratory Data ManagementConference USA 16
  17. 17. Molecular Profiling Database – GeneExpression Data Filters based on available sample & gene annotationsLog2 differenceSummary TableDetailsVIB Pharma - Laboratory Data ManagementConference USA 17
  18. 18. Molecular Profiling Database – GeneExpression Data Multiple Visualizations Filters based on available sample & gene annotations Sample Spread (Scatter Plot) Sample Spread (Box Plot)VIB Pharma - Laboratory Data ManagementConference USA 18
  19. 19. High level visualization of gene mutationsin cell linesTissue and GeneTumor mutations incategorizations specific cell linesVIB Pharma - Laboratory Data ManagementConference USA 19
  20. 20. A Target Prioritization Tool Input from Multiple Data SourcesList of Potential Targets (Target Classes) Gene Gene Gene Tool siRNA Knockout Expression Copy Number Mutation Compounds Functional Functional Scores Scores Scores Scores Scores Scores Prioritized Target List #1 Prioritized Target List #2 VIB Pharma - Laboratory Data Management Conference USA 20
  21. 21. Concluding remarks/Acknowledgements We will get more data. If we fail to organize and summarize, we‟ll waste a lot of time and money Some standards will be necessary, we‟ll have to compromise at times Not everything can and will be in one place. Defined interfaces (data, processes) between disparate systems are desperately needed to enable data interchange. My colleagues in Research Informatics & Hematology/Oncology Therapeutic AreaVIB Pharma - Laboratory Data ManagementConference USA 21