Your SlideShare is downloading. ×

8. Henrik Seidel, Bayer Healthcare

426
views

Published on


0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
426
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Setting up RNA-seq in a pharma company - an experience report Eagle Genomics 3rd Symposium Henrik Seidel 2013-03-21 Bayer Healthcare – Global Drug Discovery – Target Discovery TechnologiesPage 1 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013
  • 2. Setting the SceneBioinformatics at Bayer Healthcare• The number of bioinformatics scientists has been quite stable at BHC over the last years (5 permanent scientists plus one temporary postdoc supporting oncology, gynecology, cardiovascular research, and drug repositioning).• The tasks have changed over time. The importance and impact of integrating data from different sources has increased dramatically and has become a key to successful research.• With NGS came a number of novel challenges related to high performance computing, data storage, and algorithms which require new ways of thinking and new types of solutions. For pharma, this also includes legal aspects, because we work with patient data.• Oncogenomics and GWAS are new areas highly requested from therapeutic research groups and stress our small team beyond capacity limits.Page 2 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013
  • 3. History of This TalkWhat I told Richard Holland last year when he invited me to talk at thissymposium:• We have limited experience with NGS data analysis so far. I would not feel comfortable giving a talk to people who have probably much more experience than we have how to deal with NGS analysis in a pharma company.• I wouldn’t be able a talk about solutions, it would be a description of challenges and tasks we have to face when introducing new technologies such as NGS or HPC in a pharma context and where we are on our way right now.What Richard answered to me:• Thats exactly what we are looking for!Page 3 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013
  • 4. RNA-Seq – The Hype Curve and HowWe Saw it What we hoped for when we were here:  RNA-Seq is just counting  No dependency on microarray generations  No cross hybridization issues  Easy to compare across studies and labs  Can compare absolute expression levels of different genes  Can identify novel transcript variants  Can resolve transcript variant expression levels  Can disentangle transcripts from different species  Provides information on sequence modifications  Can identify fusion transcripts  Sensitivity can be simply scaled up by increasing number of reads per samplePage 4 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013
  • 5. RNA-Seq – The Hype Curve and HowWe See it Would have been lovely, but:  Long turnaround times compared to microarrays  Strong dependency on library preparation protocol  Library generation is a sensitive procedure and consistent quality of libraries is challenging  Ensuring consistent cluster density on illumina sequencers is challenging (→ sequence quality, reads per sample)  Determining expression levels of transcript variants is not really solved and may even require full length transcript sequencing  RNA-seq has several biases as well (examples: lower variability for long transcripts; alleles deviating from the reference genome)Page 5 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013
  • 6. First Decisions …Do we want to migrate from Do we want to establish a high-microarrays to RNA-Seq? throughput sequencer in-house?Yes, because: No, because: Despite the challenges and even • Technologies are developing very after re-adjusting some quickly → devices are obsolete overexpectations, RNA-seq has shortly after or in the worst case many advantages even before they are fully RNA-seq is becoming the standard established in-house and the capital technology for whole-transcriptome expenditure is depreciated expression analysis • Establishing such a complex device in-house requires significant resources • This may change given the experiences we have made with NGS service providers so far.Page 6 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013
  • 7. The ideal world of RNA-seq – howwould it look like? No amplification required Low costs Single molecule sequencing Ideal RNA- No bias by sequence Full length reads Seq World No bias by sample prep Very fast No errorsPage 7 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013
  • 8. RNA-Seq Data Analysis – Overview Provider Examples: Analysis Pipelines Public Galaxy, KNIME Examples: Genedata Expressionist, Commercial Partek Genomics Suite Genomatix Examples: In-House Postdoc’s Pipeline Tools for Individual Steps of Analysis Provider Examples: Public Academic groups, Broad, Sanger, EBI Quality Read Read Examples: Control Alignment Assembly Commercial Partek Genomics Suite, Genomatix Examples: In-House Postdoc’s Fusion Detection Transcript Transcript Fusion Mapping Discovery Detection Location Statistical Genome Annotation Service Analysis Browser Databases On-site Cloud ProviderPage 8 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013
  • 9. Tools for Individual Steps of AnalysisChallenges for a small bioinformatics group:► Rapid development of new and improved algorithms ongoing → staying up-to- date is already a challenge:  Which tools exist for each type of analysis?  How do they compare?  Which combinations of tools are useful?► Most algorithms are developed by the academic community  Local installation and maintenance of many diverse individual tools required  Interoperability of tools not always given  Long-term maintenance of tools unclear  Most tools are command line tools or R packages which are not suitable for lab scientistsPage 9 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013
  • 10. Analysis Pipelines►Public Pipeline Tools  Several public open source tools exists for creating analysis pipelines (e.g., Galaxy, KNIME) Pros Cons No license costs Patchwork of tools High flexibility High maintenance effort Not optimized for performance►Commercial Pipeline Tools  Analysis systems providing and connecting all required analysis tools (e.g., Genedata Refiner Genome / Analyst, Partek Genomics Suite) Pros Cons Seamless integration of analysis tools Additional delay for adding new algorithms Optimized for performance Less flexibility Maintenance by vendor License costs Uniform interface suitable for lab scientistsPage 10 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013
  • 11. Analysis Pipeline - DecisionDual analysis system strategy: Proprietary: In-house data analysis pipeline established in collaboration with Max Planck Institute for Molecular Genetics (high flexibility, but for experts only) Commercial: Genedata Refiner Genome / Analyst is was tested and is now licensed and established as standard analysis and visualization tool (less flexibility, but not just for experts)Advantage of dual strategy: Commercial analysis system covers the majority of NGS data analysis types and relieves our small bioinformatics group from having to maintain an up-to- date infrastructure – bioinformatics scientists can focus on data analysis and biological interpretation Proprietary in-house data pipeline can be enhanced and extended for more sophisticated types of analyses as time permits and as neededPage 11 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013
  • 12. RNA-Seq Data ► Proprietary Data  Transition from microarrays to RNA-seq for expression analysis will result in about 1000 samples with in-house RNA-seq data per year ► Public Data  Public initiatives have generated and will continue to generate large volumes of RNA-seq dataCancer Study Tumor RNA-SeqBreast Invasive Carcinoma (TCGA) 774Kidney Renal Clear Cell Carcinoma (TCGA, Provisional) 419Uterine Corpus Endometrioid Carcinoma (TCGA, Provisional) 333 • It is currently unclear which data will beHead and Neck Squamous Cell Carcinoma (TCGA, Provisional) 263 available at level 1 (read sequences).Colon and Rectum Adenocarcinoma (TCGA, Nature 2012) 244 • Level 1 data is required for application ofLung Squamous Cell Carcinoma (TCGA, Nature 2012) 178 improved analysis tools (e.g., betterLung Adenocarcinoma (TCGA, Provisional) 92 detection of new transcript variants orBladder Urothelial Carcinoma (TCGA, Provisional) 56 gene fusions)Liver Hepatocellular Carcinoma (TCGA, Provisional) 17 • Level 1 data usually requires registrationKidney Renal Papillary Cell Carcinoma (TCGA, Provisional) 14 of Principal Investigators and analysesThyroid Carcinoma (TCGA, Provisional) 3 at, e.g., NCBI. 2393 Page 12 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013
  • 13. Data Storage► Storage of RNA-seq raw data:  Disk space using standard central IT services is too expensive highly needed  Possible alternatives:  Storage of data in the cloud Pros Cons No hardware maintenance required All analysis programs must run in the cloud Scalable Data security delegated to cloud provider Allows analysis of combined proprietary and public data Some restrictions due to data privacy acts  Dedicated non-standard in-house platform Pros Cons Easier to comply with data privacy requirements Non-standard hardware Analysis software does not need to be ready for cloud Up-scaling requires new investments No direct access to public data that is already in the cloudPage 13 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013
  • 14. Data Offline Storage► Q: Which data needs to be on-line for data analysis?  Data that shall be re-analyzed with improved analysis algorithms  Data that shall be joined with additional new public or proprietary data sets for an analysis of increased statistical power  Example: Identifying tumor subtypes might require study sizes that are not achieved by single studies► Q: Which data can be offline?  Level-1 sequence data one year after initial analysis  Do not move to offline storage if data re-analysis or meta-study analysis within next 6 months is likely► Q: How should offline data be stored?  Keep original disks obtained from sequencing providers  Other options: Amazon Glacier; Tape Library; Large, cheap and slow hard disks; “Cold Storage”Page 14 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013
  • 15. Data Security & Data Privacy► Data Security is essential for protection of IP and for compliance with data privacy acts  Systems, especially if in the cloud, must have intrusion protection  Strict access control to data  Data encryption on external media and in archives  Ideally, data on permanent storage systems (e.g., hard disks) is always encrypted and is decrypted on-the-fly by analysis software  Usage of anonymized or pseudo-anonymized data whenever possiblePage 15 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013
  • 16. Data Security & Data Privacy► Data Privacy and Legal Aspects  RNA-seq data is like a fingerprint. Use of external service providers must be compliant with the “Bundesdatenschutzgesetz” und EU regulations. Service provider:  Must prove data security either by a certification according to ISO 27001, ISO 27003 or ISO 27005.  Must sign the Standard Contractual Clauses of the EU commission decision 2010/87/EU  Must specify the data centers where data will be stored and analyzed.  Informed consent has to be adapted to cover the generation of RNA-seq data and storage of data in the cloud or at service providers (possibly outside Germany / Europe)Page 16 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013
  • 17. Strategy for Data Handling & Analysis► Short-Term/Mid-Term: store and analyze data on local infrastructure  Standard process for transferring data to local infrastructure and for archiving data disks from NGS providers is being established  Currently setting up in-house HPC platform and investigating options for offline storage  Use dual analysis system strategy (proprietary and commercial analysis pipelines)Page 17 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013
  • 18. Outlook► Long-term: store and analyze data in the cloud?  Clarify conditions for using cloud computing for NGS data analysis  Clarify cloud-readiness of commercial tools  Wait for company standards for cloud usage for confidential data  Run pilot analyses in the cloud  Select and test providers of PaaS NGS data analysis infrastructures  Clarify future use of Cloud Computing  Will most likely be used for peak needs for computational power and for analyzing large public data sets  Complete outsourcing of HPC infrastructure into the cloud not likely in the near future but may be a long-term optionPage 18 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013
  • 19. Thank you!Page 19 • Eagle’s 3rd Symposium • Henrik Seidel • Bayer Healthcare • March 2013