Your SlideShare is downloading. ×
0
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Dmla0910 – Hoeck– Presentation

350

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
350
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. A foggy affair – sifting throughthe „omics data floodWolfgang HoeckPrincipal Business Analyst – 09/23/2010 R&DI - Research & Development Informatics
  • 2. Today‟s Presentation Terms and Domain Definitions The “problem space” from a scientists perspective Real and perceived bottlenecks Deluge of different „omics data types Data Summarization: Examples in the experimental data space Taking the next step in text: Adding context Bringing it all together – what does it mean? Concluding remarksVIB Pharma - Laboratory Data ManagementConference USA 2
  • 3. „omics Data Definition and Drug Discovery The English-language neologism omics informally refers to a field of study in biology ending in –omic, such as genomics or proteomics. The related suffix –ome is used to address objects of study of such fields, such as the genome or proteome, respectively. Transcriptomics – the study of transcripts in a cell/tissue In a larger sense data representing: Gene Expression, Gene Amplification/Deletion, Gene Mutations In the context of cell lines and tissues integrated together to establish a picture of a disease and potential intervention pointsVIB Pharma - Laboratory Data ManagementConference USA 3
  • 4. Drug Target Identification & Validation Structured Data Semi- & Unstructured Data Experimental Literature Data DataRaw data available Analysis Extraction Raw data not or only partially available• Profiling Data Target Identification • Gene Expression • Gene Copy Number • Target/Disease Associations • Gene Mutation • Specific experiment insight• Functional Data • Publication Density • si/shRNA screens Target Validation • Transfections • Knock-outs Novel insights Confirmatory insights Therapeutic Molecule VIB Pharma - Laboratory Data Management Conference USA 4
  • 5. Key Issues In Oncology: highly diverse data sets in need of integration Numerous datasets not connected to each other: Datasets located on hard drives, local machines, servers Datasets not “written in the same language”: Distinctly different annotations & data formats No unified global interface/portal: Cannot see what datasets exist internally or externally, Cannot access datasets X NextGen X Copy Sequence X Number Taq siRNA X Man X X Drug X X Sensitivity X Karyotype X Images X Microarray X X Cell Line X Used with permission from J.ArgentoVIB Pharma - Laboratory Data Management ProfilingConference USA 5
  • 6. Technology Advances contributing to the„omics data deluge Cheaper “old-world” technologies – Microarrays: More refined, widely available, cheaper – Analysis software: Standardized processing, more commercial choices, more in public domain Newer high-throughput technologies – NextGen Sequencing – from 30 bp to 120 bp: Revolutionary method to gain insight into genomics landscape of a sample • Transcriptomics (DGE, Digital Gene Expression): No pre-defined array necessary, very sensitive, detailed insight into a gene‟s transcriptome • Fusion Genes/Splice Variants: Detection of genes that normally don‟t fit together. • Alternative Transcripts: One gene, multiple versions • Genomics: Sequence variations among samples http://seqanswers.com/VIB Pharma - Laboratory Data ManagementConference USA 6
  • 7. Academic “mega-Projects” contributing tothe „omics data deluge The Cancer Genome Atlas Project (TCGA) – A Rich Catalog of Human Cancer Genomes – 25 cancer types; 500+ samples each; microarray gene expression, amplification/deletion data; traditional and NextGen sequencing data; epigenetic data; clinical data; data at 4 levels of data reduction; The Sanger Cancer Genome Project – A rich characterization of cancer cell lines, tissues and responses – Catalog of somatic mutations in cancer cell lines and tissues; web tools, databases, downloads 1000 Genomes Project – A Deep Catalog of Human Genetic Variation – Sequence 1000+ human genomes to detect sequence variations in population segmentsVIB Pharma - Laboratory Data ManagementConference USA 7
  • 8. Data Deluge – TCGA Ovarian Cancer example Transcriptome Clinical 3 levels of data: Copy Number • Raw, un processed • Processed Epigenetic Data • Gene-level summaries SNP SequencesSamples (tumor, normal, cell line, 11/500+) Screenshot taken from 1 Sample (25000 data points, one kind of data) http://tcga-data.nci.nih.gov/tcga/ VIB Pharma - Laboratory Data Management Conference USA 8
  • 9. Where are the bottlenecks? Storage Space and Data Transfers: TB needed (already for microarray data), not just GB – Local storage: Attached to analysis computer, fast connectivity, stored raw data files and processed files – Centralized storage: Remote storage for sharing data across sites, data transfer speeds are an issue Analytical Skills and Computing Power: – Computing power: e.g.: 8 cpu, 16 GB Ram, 1.5 days to process one large scale copy number data set. How many CPUs can you put to work? Parallelizing the load on clusters does help – Human power: Still lots of data manipulation needed, domain knowledge is necessary Data sharing and integration capability: – Simplicity in user interface: Biologists are not computer scientists – Data cannot be shared as files: 1 gene copy number data set = 25 M rows of data. – Data needs summarization to enable broad surveys – Query performance: The faster the better, make users aware of what they are asking for.VIB Pharma - Laboratory Data ManagementConference USA 9
  • 10. Who is affected by the bottlenecks? Increasing Level of summarization The Information Technologist – Cares about how much data needs to be stored where, how it will be transferred, how long it needs to be kept, what needs to be backed up, how big will the database get, etc. The Bioinformaticist – Cares about the raw data, needs powerful hardware, fast algorithms, plenty of storage space, means to share results The Scientist – Cares about analyzed results, wants immediate access, ability to interact with the data, ask specific questions about a gene, group of genes or samples (tissues, cell lines) The Manager – Cares about the final result – a novel, interesting, validated target ideally notified via automated e-mailsVIB Pharma - Laboratory Data ManagementConference USA 10
  • 11. Workflow/Data for one (1) sample inNextGen Sequencing Microarray: 50,000+ probe set result values per sample NextGen Sequencing: 100M+ reads per sample depth of read coverage at each positionJunctions.bed: possible splice junctionsDigital_expression.txt: normalized read counts (RPKM) with coverage statisticsAbnormal_Junction.bed: possible junctions caused by translocation eventsVIB Pharma - Laboratory Data ManagementConference USA 11
  • 12. Our appetite for data is enormous, but canwe digest it? Gene Expression: Microarrays, NextGen Sequencing Gene Copy Number Variations: Microarrays Gene Mutations: Classic and NextGen Sequencing Gene Methylations: Microarray Cell Line Panels Response Profiles: Plate-based siRNA screens: gene library screens Phenotypic screens: Knock-out animals Or do we all need a lifetime supply of indigestion pills?VIB Pharma - Laboratory Data ManagementConference USA 12
  • 13. Is cloud computing a solution? Many Software-as-a-Service (SaaS) vendors – Compendia Bioscience: Oncomine, Oncomine Power Tools – NextBio: NextBio Basic, Professional, Enterprise – GenomeQuest: NGS Data Management – DNAnexus: NGS Data Management Keeping all the data that need integration in one place makes life a lot easier No internal IT resources required However, companies subscribing to these services lose out on building an internal knowledgebase Is this then a solution to this problem? Partially!!VIB Pharma - Laboratory Data ManagementConference USA 13
  • 14. Oncomine Power Tools Used with permission from Compendia BioscienceVIB Pharma - Laboratory Data ManagementConference USA 14
  • 15. Data Reduction & InteractiveVisualizations – the key to success? Don‟t start in the weeds – take a step back: Create summarizations; e.g.: Summarize at Gene level to enable systematic surveys However, enable digging into the weeds: Select a Gene and view the details – spread of sample values, sequence coverage of a gene‟s exon Make it interactive at every level: Search for Gene lists, enable filtering by annotations (cellular location, target class, pathways, etc.) Clearly define what type of data you are dealing with: identity and annotations are criticalVIB Pharma - Laboratory Data ManagementConference USA 15
  • 16. Internal Prototyping WorkflowOperational Layer Knowledge Layer Query & Visualize Data Mapping Import Information Links Profiling Data Warehouse Data Analysis …VIB Pharma - Laboratory Data ManagementConference USA 16
  • 17. Molecular Profiling Database – GeneExpression Data Filters based on available sample & gene annotationsLog2 differenceSummary TableDetailsVIB Pharma - Laboratory Data ManagementConference USA 17
  • 18. Molecular Profiling Database – GeneExpression Data Multiple Visualizations Filters based on available sample & gene annotations Sample Spread (Scatter Plot) Sample Spread (Box Plot)VIB Pharma - Laboratory Data ManagementConference USA 18
  • 19. High level visualization of gene mutationsin cell linesTissue and GeneTumor mutations incategorizations specific cell linesVIB Pharma - Laboratory Data ManagementConference USA 19
  • 20. A Target Prioritization Tool Input from Multiple Data SourcesList of Potential Targets (Target Classes) Gene Gene Gene Tool siRNA Knockout Expression Copy Number Mutation Compounds Functional Functional Scores Scores Scores Scores Scores Scores Prioritized Target List #1 Prioritized Target List #2 VIB Pharma - Laboratory Data Management Conference USA 20
  • 21. Concluding remarks/Acknowledgements We will get more data. If we fail to organize and summarize, we‟ll waste a lot of time and money Some standards will be necessary, we‟ll have to compromise at times Not everything can and will be in one place. Defined interfaces (data, processes) between disparate systems are desperately needed to enable data interchange. My colleagues in Research Informatics & Hematology/Oncology Therapeutic AreaVIB Pharma - Laboratory Data ManagementConference USA 21

×