Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

  • 623 views
Uploaded on

Clinical genomics informatics talk -- Lisbon, Dec. 5, 2013

Clinical genomics informatics talk -- Lisbon, Dec. 5, 2013

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
623
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
12
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Implement a cloud-based, secure scalable, computing infrastructure that is capable of translating the potential benefits of high throughput sequencing into actual genetic diagnosis to health care professionals.Azure: 10 L instances/ 24h a day. / 30 TB/year. / 10 GB of SQL Azure space / 30-­‐100 TB
  • We have seen some examples of the look and feel of e-SC. Now we briefly go over the architecture.SaaS – Science as a Service

Transcript

  • 1. Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central Genome Informatics 2013 – P.Missier Paolo Missier School of Computing Newcastle University, UK Genome Informatics Lisbon, Dec. 5, 2013 With thanks to Simon Woodman, Jacek Cała
  • 2. Cloud e-genome - Motivation • 2 year pilot project • Funded by UK’s National Institute for Health Research (NIHR) through the Biomedical Research Council (BRC) • Nov. 2013: Cloud resources from Azure for Research Award • 1 year’s worth of data/network/computing resources Genome Informatics 2013 – P.Missier Aim: • To translate genetic testing by whole-exome sequencing into clinical practice Objectives: • Cost, Scalability: Demonstrate the cost-effectiveness of wholeexome data processing pipelines at population scale • Usability: Demonstrate a user-facing tool for variant interpretation and genetic diagnosis by clinicians
  • 3. Approach and testbed • Technical Approach: • Move bulk processing, from a dedicated HPC cluster to a cloud infrastructure (IaaS) • Port current NGS pipelines (scripts) to cloud-based workflow technology • Implement user tools for clinical diagnosis as cloud apps (SaaS) • Testbed and scale: Genome Informatics 2013 – P.Missier • Neurological patients from the North-East of England, focus on rare diseases • Initial testing on about 300 sequences • 2500-3000 sequences expected within 12 months
  • 4. Key technical requirements • Scalability • In the rate and number of patient sequence submissions • In the density of sequence data (from whole exome to whole genome) • Flexibility, Traceability, Comparability across versions • Simplify experimenting with alternative pipelines (choice of tools, configuration parameters) • Trace each version and its executions • Ability to compare results obtained using different pipelines and reason about the differences Genome Informatics 2013 – P.Missier • Openness • Simplify the process of adding: • New variant analysis tools • New statistical methods for variant filtering, selection, and ranking • Integration with third party databases
  • 5. Outline • Current pipelines • The e-Science Central workflow management system • Home-grown system, cloud-based • SaaS: “Science as a Service” • Provenance-aware • Porting the pipelines to e-Science Central • Expected benefits • Strategy and issues Genome Informatics 2013 – P.Missier • Role of Provenance • “Where do these variants come from?” • “Why do these results differ?”
  • 6. Current pipelines – top level view • Shell scripts control a suite of tools • Deployed on a local HPC cluster • Loadable modules and overall system maintained by dedicated staff 20 compute nodes 48/96GB RAM / 250GB disk 19TB usable storage space Genome Informatics 2013 – P.Missier Gigabit Ethernet
  • 7. Pipeline – breakdown view Filtering - GATK 1. Alignment - BWA 2. Cleaning - Picard Genome Informatics 2013 – P.Missier 4. Coverage - bedTools 3. Sequence recalibration GATK 5. Variant calling GATK 6. Variant recalibration GATK 7. Annotation - In house - WAnnovar
  • 8. Why move to the cloud? Standard benefits of virtualization, including: • Economy: No capital expenses. Affordable operations expenses • Elasticity: respond to bursts in submissions without upfront costs • Scale out vs scale up • New workflows can be deployed on new nodes on demand using VM • Workflow engine may exploit parallelism within the pipelines • Virtual storage with no real size limitations • Possible issues Genome Informatics 2013 – P.Missier • Security, privacy • Addressed at policy level
  • 9. Why port to workflow? • Programming: • • • • Workflows provide better abstraction in the specification of pipelines Workflows directly executable by enactment engine Easier to understand, share, and maintain over time Flexible – relatively easy to introduce variations • System: minimal installation/deployment requirements Genome Informatics 2013 – P.Missier • • • • Fewer dedicated technical staff hours required Automated dependency management, packaging, deployment Extensible by wrapping new tools Exploits data parallelism when possible • Execution monitoring, provenance collection • Persistence trace serves as evidence for data • Amenable to automated analysis
  • 10. Genome Informatics 2013 – P.Missier Porting – in progress
  • 11. Genome Informatics 2013 – P.Missier Porting - alignment
  • 12. Genome Informatics 2013 – P.Missier Porting - alignment
  • 13. Genome Informatics 2013 – P.Missier Porting – Sequence recalibration
  • 14. e-Science Central middleware û App n • Middleware on top of commodity cloud storage and compute resources App 1 Store, manage and process data App 1 .... .... Cloud Infrastructure: Storage & Compute Cloud Platform Cloud Infrastructure: Storage & Compute • Portability across platforms and configurations • Single computer • Cluster of local servers • Public cloud providers (Microsoft Azure, but also AWS) Genome Informatics 2013 – P.Missier App n • A platform for our academic research • Scalability, data management, cloud computing, medical data management
  • 15. e-Science Central: scalable, extensible • Client APIs / integrating other components • New workflow blocks easily programmed (in Java) • Integrate multiple runtime environments • - R, Octave, Java, Javascript, (Perl) • Workflow tasks mapped to multiple workers web browser e-SC control data <<worker role>> Workflow engine rich client app <<Azure VM>> Web UI Genome Informatics 2013 – P.Missier e-SC blob store <<worker role>> Workflow engine REST API e-Science Central main server JMS queue <<worker role>> workflow invocations e-SC db backend Workflow engine workflow data <<Azure VM>> Azure Blob store
  • 16. E-Science Central is Provenance-aware Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*) Provenance is a description of how things came to be, and how they came to be in the state they are in today (*) Genome Informatics 2013 – P.Missier Why does provenance matter? • • • • • To establish quality, relevance, trust To track information attribution through complex transformations To describe one’s experiment to others, for understanding / reuse To provide evidence in support of scientific claims To enable post hoc process analysis - debugging, improvement, evolution
  • 17. Role of provenance in Cloud e-genome • Ultimately, provenance is evidence in support of clinical diagnosis 1. Why do these variants appear in the output list? 2. Why have you concluded they are disease-causing? • Requires ability to trace variants through workflow execution • Simple scripting lacks this functionality • Includes tracing of user decision processes Genome Informatics 2013 – P.Missier (Still experimental)
  • 18. Comparing results across pipeline configurations Run pipeline version V1 Variant list VL1 V1  V2: Replace BWA version Modify Annovar configuration parameters Genome Informatics 2013 – P.Missier Run pipeline version V2 Variant list VL1 DDIFF (data differencing) PDIFF (provenance differencing) ?? Variant list VL2 Variant list VL2
  • 19. Genome Informatics 2013 – P.Missier PDIFF - overview WA WB
  • 20. The corresponding provenance traces d0 d0 S Sv2 d1 d1' S0 S0 d2 S0' S1 w S2 y z P1 S4 x (i) Trace A w' S5 k' h' P0 P1 S3 P0 S3' k h P0 Genome Informatics 2013 – P.Missier d2 S3 P1 S2v2 z' y' P0 S4 P1 x' (ii) Trace B
  • 21. Delta graph computed by PDIFF S, Sv 2 (version change) PDIFF helps determine the impact of variations in the pipeline d1 , d1 S0' S0' S3' S1 , S5 S0 , S0 h, h Genome Informatics 2013 – P.Missier S0 , S3 w, w (service repl.) k, k P0 branch of S2 P1 branch of S2 S2 , S2v 2 (version change) y, y P0 branch of S4 z, z x, x P1 branch of S4
  • 22. Summary • Cloud e-genome has just begun (Sept. 2013) • Variant calling and interpretation for clinical use • Whole-exome sequence processing on a cloud infrastructure • Windows Azure – project sponsor • Currently porting existing pipelines to e-Science Central workflow • Provenance as a key element of supporting evidence Genome Informatics 2013 – P.Missier • Scalability testing to start in Sept 2014