C4Bio paper talk


Published on

presented at C4Bio workshop, May 26th 2014, Chicago

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Implement a cloud-based, secure scalable, computing infrastructure that is capable of translating the potential benefits of high throughput sequencing into actual genetic diagnosis to health care professionals.

    Azure: 10 L instances/ 24h a day. / 30 TB/year. / 10 GB of SQL Azure space / 30-­‐100 TB

  • Coverage information translates into confidence on variant call

    quality score recalibration --
    machine produces colour coding for the 4 aminocids, along with a p-value indicating the highest prob call; these are the Q scores
    different platforms give differnst system bias on Q scores -- and also depending on the lane. Each lane gives a different systematic bias. The point of recalibration is to correct for this type of bias
  • E-Science Central Integrate multiple runtime environments
    - R, Octave, Java, Javascript, (Perl)
  • Traditional Variant Callers
    Go through the whole genome to identify locations where a number of non-reference bases appears to call SNPs
    Gapped mapping to identify INDELs
    Different algorithms to calculate SNP and INDELs likelihoods

    GATK HaplotypeCaller
    Haplotype-based calling
    Call SNPs and indels simultaneously by performing a local de-novo assembly
    Same algorithm for SNPs and Indels likelyhoods
    Artifacts caused by large INDELs recovered by assembly
  • We have seen some examples of the look and feel of e-SC.
    Now we briefly go over the architecture.
    SaaS – Science as a Service
  • Config is a trade off – shared resource has limited room for scalability. Peak times
  • Model currently is sync execution
  • Overall Azure/lampredi ratio is about 1.4 (Azure solution is 40% slower than lampredi) -- this is based on the pipeline run up GATK-phase3 (excluding), so BWA-picard-GATK1-VariantCalling.
    BWA-picard-GATK1-VariantCalling takes over 90% of the total time of the pipeline, so approximation should be ok.
    our input in compressed form is (13.85 GiB on average).
  • C4Bio paper talk

    1. 1. GenomeInformatics2013–P.Missier From scripted HPC-based NGS pipelines to workflows on the cloud Jacek Cała, Yaobo Xu, Eldarina Azfar Wijaya, Paolo Missier School of Computing Science and Institute of Genetic Medicine Newcastle University, Newcastle upon Tyne, UK C4Bio workshop @CCGrid 2014 Chicago, May 26th, 2014
    2. 2. C4Bio2014@CCGrid,-P.Missier The Cloud-e-Genome project NGS data processing: provide mechanisms to rapidly and flexibly create new exome sequence data processing pipelines, and to deploy them in a scalable way; Cost Scalability Flexibility Data to insightHuman variant interpretation for clinical diagnosis: provide clinicians with a tool for analysis and interpretation of human variants • 2 year pilot project • Funded by UK’s National Institute for Health Research (NIHR) through the Biomedical Research Council (BRC) • Nov. 2013: Cloud resources from Azure for Research Award • 1 year’s worth of data/network/computing resources Challenge: to deliver the benefits of WES/WGS technology to clinical practice
    3. 3. C4Bio2014@CCGrid,-P.Missier Key technical goals • Scalability • In the rate and number of patient sequence submissions • In the density of sequence data (from whole exome to whole genome) • Flexibility, Traceability, Comparability across versions • Simplify experimenting with alternative pipelines (choice of tools, configuration parameters) • Trace each version and its executions • Ability to compare results obtained using different pipelines and reason about the differences • Openness. Simplify the process of adding: • New variant analysis tools • New statistical methods for variant filtering, selection, and ranking • Integration with third party databases
    4. 4. C4Bio2014@CCGrid,-P.Missier Approach and testbed Technical Approach: • double- porting • Infrastructure: HPC cluster to cloud (IaaS) • Implementation: NGS pipelines from scripts to workflow • Implement user tools for clinical diagnosis as cloud apps (SaaS) Testbed and scale: • Neurological patients from the North-East of England, focus on rare diseases • Initial testing on about 300 sequences • 2500-3000 sequences expected within 12 months
    5. 5. C4Bio2014@CCGrid,-P.Missier Why port to workflow? • Programming: • Workflows provide better abstraction in the specification of pipelines • Workflows directly executable by enactment engine • Easier to understand, share, and maintain over time • Flexible – relatively easy to introduce variations • System: minimal installation/deployment requirements • Fewer dedicated technical staff hours required • Automated dependency management, packaging, deployment • Extensible by wrapping new tools • Exploits available data parallelism (but not automagically) • Reproducibility • Execution monitoring, provenance collection • Persistence trace serves as evidence for data • Amenable to automated analysis
    6. 6. C4Bio2014@CCGrid,-P.Missier Scripted pipeline Recalibration Corrects for system bias on quality scores assigned by sequencer Computes coverage of each read. VCF Subsetting by filtering, eg non-exomic variants Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations Aligns sample sequence to HG19 reference genome using BWA aligner Cleaning, duplicate elimination Picard tools Variant calling operates on multiple samples simultaneously Splits samples into chunks. Haplotype caller detects both SNV as well as longer indels Variant recalibration attempts to reduce false positive rate from caller
    7. 7. C4Bio2014@CCGrid,-P.Missier From scripts to workflows
    8. 8. C4Bio2014@CCGrid,-P.Missier Workflow nesting
    9. 9. C4Bio2014@CCGrid,-P.Missier Pipeline evolution Pipeline: set C = { c1 … cn } of components -- tool wrappers Each ci has a configuration conf(ci) and a version v(ci) …and why • Technology / algorithm evolution • Traditional GATK variant caller  GATK haplotype caller • Does the interface change? • Do the operational assumptions change? Eg. GATK Variant Recalibrator requires large input data. Not suitable for targeted sequencing What can change 1 – Tool version: v(ci)  v’(ci) 2 - Tool replacement / add / remove: ci  c’I 3 – Configuration parameters conf(ci)  conf’(ci) (*) S. Pabinger, A. Dander, M. Fischer, R. Snajder, M. Sperk, M. Efremova, B. Krabichler, M. R. Speicher, J. Zschocke, and Z. Trajanoski, “A survey of tools for variant analysis of next-generation genome sequencing data.” Briefings in bioinformatics, pp. bbs086–, Jan. 2013 Just for sequence alignment Pabinger et al. in their survey (*) list 17 aligners while for variant annotation they refer over 70 tools
    10. 10. C4Bio2014@CCGrid,-P.Missier Role of provenance Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*) Provenance is a description of how things came to be, and how they came to be in the state they are in today (*) • Provenance is evidence in support of clinical diagnosis 1. Why do these variants appear in the output list? 2. Why have you concluded they are disease-causing? • Requires ability to trace variants through workflow execution • Simple scripting lacks this functionality “Where do these variants come from?” “Why do these results differ?”
    11. 11. C4Bio2014@CCGrid,-P.Missier Comparing results across pipeline configurations Run pipeline version V1 V1  V2: Replace BWA version Modify Annovar configuration parameters Variant list VL1 Variant list VL2Run pipeline version V2 ?? Variant list VL1 Variant list VL2 DDIFF (data differencing) PDIFF (provenance differencing) Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience (2013): doi:10.1002/cpe.3035.
    12. 12. C4Bio2014@CCGrid,-P.Missier PDIFF - overview WA WB
    13. 13. C4Bio2014@CCGrid,-P.Missier The corresponding provenance traces d1 S0 S0' w h S3 S2 y z S4 x k S1 d2 d1' S0 k'h' S3' S2v2 w' S3 S4 y' z' x' S5 d2 (i) Trace A (ii) Trace B P0 P1 P0 P1 P0 P0 P1P1 S Sv2 d0 d0
    14. 14. C4Bio2014@CCGrid,-P.Missier Delta graph computed by PDIFF x, x y, y z, z w, w k, k S0 , S3 S0' S3' S1, S5 (service repl.) S2, S2v2 (version change) h, h S0' P0 branch of S4 P1 branch of S4 P0 branch of S2 P1 branch of S2 S,Sv2 (version change) S0, S0 d1, d1 PDIFF helps determine the impact of variations in the pipeline
    15. 15. C4Bio2014@CCGrid,-P.Missier HPC Cluster configuration 16compute nodes 48/96GB RAM / 250GB disk 19TB usable storage space Gigabit Ethernet Shared resource for Institute-wide research Submission script specifies node / core requirements Computation waits until resources are available Current config: • BWA alignment: 2 cores • GATK: 8 cores
    16. 16. C4Bio2014@CCGrid,-P.Missier The case for cloud in genome informatics (*) (*) Stein, Lincoln D. “The Case for Cloud Computing in Genome Informatics.” Genome Biology 11, no. 5 (January 2010): 207. • Storage + computing resources co-located in a cloud • Privacy issues • Public, private, or hybrid • Fluctuating demand  benefits from elasticity • Web-based access to clinicians simplifies adoption
    17. 17. C4Bio2014@CCGrid,-P.Missier Workflow on Azure Cloud - configuration <<Azure VM>> Azure Blob store e-SC db backend <<Azure VM>> e-Science Central main server JMS queue REST APIWeb UI web browser rich client app workflow invocations e-SC control data workflow data <<worker role>> Workflow engine <<worker role>> Workflow engine e-SC blob store <<worker role>> Workflow engine Workflow engines Top level workflow Sub-workflows Test configuration: 3 nodes, 24 cores
    18. 18. C4Bio2014@CCGrid,-P.Missier Workflow and sub-workflows execution To e-SC queue To e-SC queue Executable Block To e-SC queue e-SC db <<Azure VM>> e-Science Central main server JMS queue REST APIWeb UI web browser rich client app workflow invocations e-SC control data workflow data <<worker role>> Workflow engine <<worker role>> Workflow engine e-SC blob store <<worker role>> Workflow engine Workflow invocation executing on one engine (fragment)
    19. 19. C4Bio2014@CCGrid,-P.Missier Multi-sample processing Sample list [S1…Sk] Top level workflow Variant files [VCF1…VCFk] Map semantics: push K new workflow invocations to the e-SC queue <<Azure VM>> e-Science Central main server JMS queue REST APIWeb UI web browser rich client app workflow invocations e-SC control data workflow data <<worker role>> Workflow engine <<worker role>> Workflow engine e-SC blob store <<worker role>> Workflow engine BWA (S1) BWA (S2) …
    20. 20. C4Bio2014@CCGrid,-P.Missier Sub-workflows enqueued recursively Exec block Specifies threading  OS maps threads to available cores Sub-workflow Gets added to queue Sub-workflow One instance gets added to queue for each input sample
    21. 21. C4Bio2014@CCGrid,-P.Missier Preliminary cost estimates 2 samples 1 x 8 core 30 hr @ £0.821 / h = £12.3 / sample 6 samples 3 x 8 core 47 hr @ £2.5 / h = £19 / sample Cloud deployment makes cost easy to calculate Trade-off: • Better flexibility, scalability • But loss of performance Some tuning required Cost model based on node uptime Better node utilization • Larger sample batches Remove unnecessary wait time • Make sub-workflows async
    22. 22. C4Bio2014@CCGrid,-P.Missier Summary • Whole-exome sequence processing on a cloud infrastructure • Windows Azure – project sponsor • Tracking provenance as evidence and for change analysis • Porting HPC scripted pipeline to workflow model and technology • Scalability, Flexibility, Evolvability