Your SlideShare is downloading. ×
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

698
views

Published on

Clinical genomics informatics talk -- Lisbon, Dec. 5, 2013

Clinical genomics informatics talk -- Lisbon, Dec. 5, 2013

Published in: Technology, Health & Medicine

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
698
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Implement a cloud-based, secure scalable, computing infrastructure that is capable of translating the potential benefits of high throughput sequencing into actual genetic diagnosis to health care professionals.Azure: 10 L instances/ 24h a day. / 30 TB/year. / 10 GB of SQL Azure space / 30-­‐100 TB
  • We have seen some examples of the look and feel of e-SC. Now we briefly go over the architecture.SaaS – Science as a Service
  • Transcript

    • 1. Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central Genome Informatics 2013 – P.Missier Paolo Missier School of Computing Newcastle University, UK Genome Informatics Lisbon, Dec. 5, 2013 With thanks to Simon Woodman, Jacek Cała
    • 2. Cloud e-genome - Motivation • 2 year pilot project • Funded by UK’s National Institute for Health Research (NIHR) through the Biomedical Research Council (BRC) • Nov. 2013: Cloud resources from Azure for Research Award • 1 year’s worth of data/network/computing resources Genome Informatics 2013 – P.Missier Aim: • To translate genetic testing by whole-exome sequencing into clinical practice Objectives: • Cost, Scalability: Demonstrate the cost-effectiveness of wholeexome data processing pipelines at population scale • Usability: Demonstrate a user-facing tool for variant interpretation and genetic diagnosis by clinicians
    • 3. Approach and testbed • Technical Approach: • Move bulk processing, from a dedicated HPC cluster to a cloud infrastructure (IaaS) • Port current NGS pipelines (scripts) to cloud-based workflow technology • Implement user tools for clinical diagnosis as cloud apps (SaaS) • Testbed and scale: Genome Informatics 2013 – P.Missier • Neurological patients from the North-East of England, focus on rare diseases • Initial testing on about 300 sequences • 2500-3000 sequences expected within 12 months
    • 4. Key technical requirements • Scalability • In the rate and number of patient sequence submissions • In the density of sequence data (from whole exome to whole genome) • Flexibility, Traceability, Comparability across versions • Simplify experimenting with alternative pipelines (choice of tools, configuration parameters) • Trace each version and its executions • Ability to compare results obtained using different pipelines and reason about the differences Genome Informatics 2013 – P.Missier • Openness • Simplify the process of adding: • New variant analysis tools • New statistical methods for variant filtering, selection, and ranking • Integration with third party databases
    • 5. Outline • Current pipelines • The e-Science Central workflow management system • Home-grown system, cloud-based • SaaS: “Science as a Service” • Provenance-aware • Porting the pipelines to e-Science Central • Expected benefits • Strategy and issues Genome Informatics 2013 – P.Missier • Role of Provenance • “Where do these variants come from?” • “Why do these results differ?”
    • 6. Current pipelines – top level view • Shell scripts control a suite of tools • Deployed on a local HPC cluster • Loadable modules and overall system maintained by dedicated staff 20 compute nodes 48/96GB RAM / 250GB disk 19TB usable storage space Genome Informatics 2013 – P.Missier Gigabit Ethernet
    • 7. Pipeline – breakdown view Filtering - GATK 1. Alignment - BWA 2. Cleaning - Picard Genome Informatics 2013 – P.Missier 4. Coverage - bedTools 3. Sequence recalibration GATK 5. Variant calling GATK 6. Variant recalibration GATK 7. Annotation - In house - WAnnovar
    • 8. Why move to the cloud? Standard benefits of virtualization, including: • Economy: No capital expenses. Affordable operations expenses • Elasticity: respond to bursts in submissions without upfront costs • Scale out vs scale up • New workflows can be deployed on new nodes on demand using VM • Workflow engine may exploit parallelism within the pipelines • Virtual storage with no real size limitations • Possible issues Genome Informatics 2013 – P.Missier • Security, privacy • Addressed at policy level
    • 9. Why port to workflow? • Programming: • • • • Workflows provide better abstraction in the specification of pipelines Workflows directly executable by enactment engine Easier to understand, share, and maintain over time Flexible – relatively easy to introduce variations • System: minimal installation/deployment requirements Genome Informatics 2013 – P.Missier • • • • Fewer dedicated technical staff hours required Automated dependency management, packaging, deployment Extensible by wrapping new tools Exploits data parallelism when possible • Execution monitoring, provenance collection • Persistence trace serves as evidence for data • Amenable to automated analysis
    • 10. Genome Informatics 2013 – P.Missier Porting – in progress
    • 11. Genome Informatics 2013 – P.Missier Porting - alignment
    • 12. Genome Informatics 2013 – P.Missier Porting - alignment
    • 13. Genome Informatics 2013 – P.Missier Porting – Sequence recalibration
    • 14. e-Science Central middleware û App n • Middleware on top of commodity cloud storage and compute resources App 1 Store, manage and process data App 1 .... .... Cloud Infrastructure: Storage & Compute Cloud Platform Cloud Infrastructure: Storage & Compute • Portability across platforms and configurations • Single computer • Cluster of local servers • Public cloud providers (Microsoft Azure, but also AWS) Genome Informatics 2013 – P.Missier App n • A platform for our academic research • Scalability, data management, cloud computing, medical data management
    • 15. e-Science Central: scalable, extensible • Client APIs / integrating other components • New workflow blocks easily programmed (in Java) • Integrate multiple runtime environments • - R, Octave, Java, Javascript, (Perl) • Workflow tasks mapped to multiple workers web browser e-SC control data <<worker role>> Workflow engine rich client app <<Azure VM>> Web UI Genome Informatics 2013 – P.Missier e-SC blob store <<worker role>> Workflow engine REST API e-Science Central main server JMS queue <<worker role>> workflow invocations e-SC db backend Workflow engine workflow data <<Azure VM>> Azure Blob store
    • 16. E-Science Central is Provenance-aware Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*) Provenance is a description of how things came to be, and how they came to be in the state they are in today (*) Genome Informatics 2013 – P.Missier Why does provenance matter? • • • • • To establish quality, relevance, trust To track information attribution through complex transformations To describe one’s experiment to others, for understanding / reuse To provide evidence in support of scientific claims To enable post hoc process analysis - debugging, improvement, evolution
    • 17. Role of provenance in Cloud e-genome • Ultimately, provenance is evidence in support of clinical diagnosis 1. Why do these variants appear in the output list? 2. Why have you concluded they are disease-causing? • Requires ability to trace variants through workflow execution • Simple scripting lacks this functionality • Includes tracing of user decision processes Genome Informatics 2013 – P.Missier (Still experimental)
    • 18. Comparing results across pipeline configurations Run pipeline version V1 Variant list VL1 V1  V2: Replace BWA version Modify Annovar configuration parameters Genome Informatics 2013 – P.Missier Run pipeline version V2 Variant list VL1 DDIFF (data differencing) PDIFF (provenance differencing) ?? Variant list VL2 Variant list VL2
    • 19. Genome Informatics 2013 – P.Missier PDIFF - overview WA WB
    • 20. The corresponding provenance traces d0 d0 S Sv2 d1 d1' S0 S0 d2 S0' S1 w S2 y z P1 S4 x (i) Trace A w' S5 k' h' P0 P1 S3 P0 S3' k h P0 Genome Informatics 2013 – P.Missier d2 S3 P1 S2v2 z' y' P0 S4 P1 x' (ii) Trace B
    • 21. Delta graph computed by PDIFF S, Sv 2 (version change) PDIFF helps determine the impact of variations in the pipeline d1 , d1 S0' S0' S3' S1 , S5 S0 , S0 h, h Genome Informatics 2013 – P.Missier S0 , S3 w, w (service repl.) k, k P0 branch of S2 P1 branch of S2 S2 , S2v 2 (version change) y, y P0 branch of S4 z, z x, x P1 branch of S4
    • 22. Summary • Cloud e-genome has just begun (Sept. 2013) • Variant calling and interpretation for clinical use • Whole-exome sequence processing on a cloud infrastructure • Windows Azure – project sponsor • Currently porting existing pipelines to e-Science Central workflow • Provenance as a key element of supporting evidence Genome Informatics 2013 – P.Missier • Scalability testing to start in Sept 2014