Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central


Published on

Clinical genomics informatics talk -- Lisbon, Dec. 5, 2013

Published in: Technology, Health & Medicine
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Implement a cloud-based, secure scalable, computing infrastructure that is capable of translating the potential benefits of high throughput sequencing into actual genetic diagnosis to health care professionals.Azure: 10 L instances/ 24h a day. / 30 TB/year. / 10 GB of SQL Azure space / 30-­‐100 TB
  • We have seen some examples of the look and feel of e-SC. Now we briefly go over the architecture.SaaS – Science as a Service
  • Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

    1. 1. Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central Genome Informatics 2013 – P.Missier Paolo Missier School of Computing Newcastle University, UK Genome Informatics Lisbon, Dec. 5, 2013 With thanks to Simon Woodman, Jacek Cała
    2. 2. Cloud e-genome - Motivation • 2 year pilot project • Funded by UK’s National Institute for Health Research (NIHR) through the Biomedical Research Council (BRC) • Nov. 2013: Cloud resources from Azure for Research Award • 1 year’s worth of data/network/computing resources Genome Informatics 2013 – P.Missier Aim: • To translate genetic testing by whole-exome sequencing into clinical practice Objectives: • Cost, Scalability: Demonstrate the cost-effectiveness of wholeexome data processing pipelines at population scale • Usability: Demonstrate a user-facing tool for variant interpretation and genetic diagnosis by clinicians
    3. 3. Approach and testbed • Technical Approach: • Move bulk processing, from a dedicated HPC cluster to a cloud infrastructure (IaaS) • Port current NGS pipelines (scripts) to cloud-based workflow technology • Implement user tools for clinical diagnosis as cloud apps (SaaS) • Testbed and scale: Genome Informatics 2013 – P.Missier • Neurological patients from the North-East of England, focus on rare diseases • Initial testing on about 300 sequences • 2500-3000 sequences expected within 12 months
    4. 4. Key technical requirements • Scalability • In the rate and number of patient sequence submissions • In the density of sequence data (from whole exome to whole genome) • Flexibility, Traceability, Comparability across versions • Simplify experimenting with alternative pipelines (choice of tools, configuration parameters) • Trace each version and its executions • Ability to compare results obtained using different pipelines and reason about the differences Genome Informatics 2013 – P.Missier • Openness • Simplify the process of adding: • New variant analysis tools • New statistical methods for variant filtering, selection, and ranking • Integration with third party databases
    5. 5. Outline • Current pipelines • The e-Science Central workflow management system • Home-grown system, cloud-based • SaaS: “Science as a Service” • Provenance-aware • Porting the pipelines to e-Science Central • Expected benefits • Strategy and issues Genome Informatics 2013 – P.Missier • Role of Provenance • “Where do these variants come from?” • “Why do these results differ?”
    6. 6. Current pipelines – top level view • Shell scripts control a suite of tools • Deployed on a local HPC cluster • Loadable modules and overall system maintained by dedicated staff 20 compute nodes 48/96GB RAM / 250GB disk 19TB usable storage space Genome Informatics 2013 – P.Missier Gigabit Ethernet
    7. 7. Pipeline – breakdown view Filtering - GATK 1. Alignment - BWA 2. Cleaning - Picard Genome Informatics 2013 – P.Missier 4. Coverage - bedTools 3. Sequence recalibration GATK 5. Variant calling GATK 6. Variant recalibration GATK 7. Annotation - In house - WAnnovar
    8. 8. Why move to the cloud? Standard benefits of virtualization, including: • Economy: No capital expenses. Affordable operations expenses • Elasticity: respond to bursts in submissions without upfront costs • Scale out vs scale up • New workflows can be deployed on new nodes on demand using VM • Workflow engine may exploit parallelism within the pipelines • Virtual storage with no real size limitations • Possible issues Genome Informatics 2013 – P.Missier • Security, privacy • Addressed at policy level
    9. 9. Why port to workflow? • Programming: • • • • Workflows provide better abstraction in the specification of pipelines Workflows directly executable by enactment engine Easier to understand, share, and maintain over time Flexible – relatively easy to introduce variations • System: minimal installation/deployment requirements Genome Informatics 2013 – P.Missier • • • • Fewer dedicated technical staff hours required Automated dependency management, packaging, deployment Extensible by wrapping new tools Exploits data parallelism when possible • Execution monitoring, provenance collection • Persistence trace serves as evidence for data • Amenable to automated analysis
    10. 10. Genome Informatics 2013 – P.Missier Porting – in progress
    11. 11. Genome Informatics 2013 – P.Missier Porting - alignment
    12. 12. Genome Informatics 2013 – P.Missier Porting - alignment
    13. 13. Genome Informatics 2013 – P.Missier Porting – Sequence recalibration
    14. 14. e-Science Central middleware û App n • Middleware on top of commodity cloud storage and compute resources App 1 Store, manage and process data App 1 .... .... Cloud Infrastructure: Storage & Compute Cloud Platform Cloud Infrastructure: Storage & Compute • Portability across platforms and configurations • Single computer • Cluster of local servers • Public cloud providers (Microsoft Azure, but also AWS) Genome Informatics 2013 – P.Missier App n • A platform for our academic research • Scalability, data management, cloud computing, medical data management
    15. 15. e-Science Central: scalable, extensible • Client APIs / integrating other components • New workflow blocks easily programmed (in Java) • Integrate multiple runtime environments • - R, Octave, Java, Javascript, (Perl) • Workflow tasks mapped to multiple workers web browser e-SC control data <<worker role>> Workflow engine rich client app <<Azure VM>> Web UI Genome Informatics 2013 – P.Missier e-SC blob store <<worker role>> Workflow engine REST API e-Science Central main server JMS queue <<worker role>> workflow invocations e-SC db backend Workflow engine workflow data <<Azure VM>> Azure Blob store
    16. 16. E-Science Central is Provenance-aware Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*) Provenance is a description of how things came to be, and how they came to be in the state they are in today (*) Genome Informatics 2013 – P.Missier Why does provenance matter? • • • • • To establish quality, relevance, trust To track information attribution through complex transformations To describe one’s experiment to others, for understanding / reuse To provide evidence in support of scientific claims To enable post hoc process analysis - debugging, improvement, evolution
    17. 17. Role of provenance in Cloud e-genome • Ultimately, provenance is evidence in support of clinical diagnosis 1. Why do these variants appear in the output list? 2. Why have you concluded they are disease-causing? • Requires ability to trace variants through workflow execution • Simple scripting lacks this functionality • Includes tracing of user decision processes Genome Informatics 2013 – P.Missier (Still experimental)
    18. 18. Comparing results across pipeline configurations Run pipeline version V1 Variant list VL1 V1  V2: Replace BWA version Modify Annovar configuration parameters Genome Informatics 2013 – P.Missier Run pipeline version V2 Variant list VL1 DDIFF (data differencing) PDIFF (provenance differencing) ?? Variant list VL2 Variant list VL2
    19. 19. Genome Informatics 2013 – P.Missier PDIFF - overview WA WB
    20. 20. The corresponding provenance traces d0 d0 S Sv2 d1 d1' S0 S0 d2 S0' S1 w S2 y z P1 S4 x (i) Trace A w' S5 k' h' P0 P1 S3 P0 S3' k h P0 Genome Informatics 2013 – P.Missier d2 S3 P1 S2v2 z' y' P0 S4 P1 x' (ii) Trace B
    21. 21. Delta graph computed by PDIFF S, Sv 2 (version change) PDIFF helps determine the impact of variations in the pipeline d1 , d1 S0' S0' S3' S1 , S5 S0 , S0 h, h Genome Informatics 2013 – P.Missier S0 , S3 w, w (service repl.) k, k P0 branch of S2 P1 branch of S2 S2 , S2v 2 (version change) y, y P0 branch of S4 z, z x, x P1 branch of S4
    22. 22. Summary • Cloud e-genome has just begun (Sept. 2013) • Variant calling and interpretation for clinical use • Whole-exome sequence processing on a cloud infrastructure • Windows Azure – project sponsor • Currently porting existing pipelines to e-Science Central workflow • Provenance as a key element of supporting evidence Genome Informatics 2013 – P.Missier • Scalability testing to start in Sept 2014