Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Wolstencroft K - Workflows on the Cloud: scaling for national service


Published on

Presentation at BOSC2012 by Wolstencroft K - Workflows on the Cloud: scaling for national service

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Wolstencroft K - Workflows on the Cloud: scaling for national service

  1. 1. Workflows on the Cloud: Scaling for National Service Katy Wolstencroft, Robert Haines, Helen Hulme,Mike Cornell, Shoaib Sufi, Andy Brass, Carole Goble University of Manchester, UK Madhu Donepudi, Nick James Eagle Genomics Ltd, UK
  2. 2. Motivation: Workflows for DiagnosticsNHS genetic testing, e.g. colon diseaseAnnotation of SNPs in patient data, ready for interpretation by clinician.Diagnostic Testing TodayPurify DNA. PCRs exons of relevant genes (MLH1, MSH2, MSH6).Sequence, identify variants, classify: (pathogenic, not pathogenic,unknown significance etc.).Writes report to clinicianDiagnostic Testing Tomorrow (or later today) uses whole genomesequencing ANNOTATE, FILTER, DISPLAY Next Gen Seq Variation data data New problem: How do we classify all the variants that we discover?
  3. 3. Taverna Workflows Sophisticated analysis pipelines A set of services to analyse or manage data (either local or remote) Workflows run through the workbench or via a server Automation of data flow through services Control of service invocation Iteration over data sets Provenance collection Extensible and open source
  4. 4. Taverna Freely available open source Current Version 2.4 80,000+ downloads across versionPart of the myGrid Toolkit Windows/Mac OS X/ Linux/unix Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W729-32. Taverna: a tool for building and running workflows of services. Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T.
  5. 5. SNP annotationAnnotation taskLocation, Gene, TranscriptPresent in public databases,dbSNP etc Workflows are goodFrequency in e.g. 1000 genome for collecting anddata integrating data from a variety of sources,Conservation data (cross species) into one place
  6. 6. Variant Classification SNPNonsense: base Synonymous Missense: Non-insertion, causing a synonymousframeshift Affects on function Affects on splicing or splicingPremature StopNonsense codon
  7. 7. SNP Filtering / TriageWhich SNPs are the most important?Reduction of 80K data points to those withpotential clinical significance.CriteriaReduce to (disease)-specific gene listSense < Missense < Stop codon etcBased on prediction tool scoresFrequency in population (based on 1000 genome data etc)(high frequency implies non deleterious)Conservation across species (implies that change isdeleterious)
  8. 8. Workflow Taverna’s “Tool Service” feature – used to wrap Perl scripts and other command line applications Uses VEP (Ensembl) Passes references to files
  9. 9. Workflow ProvenanceRecord inferences in clinical decisions What were the parameters used to build the dataset What versions of databases, genome assembly, machine Where does each piece of evidence for/against pathogenicity originate from?
  10. 10. Infrastructure Requirements Execute analysis workflows Accessible to clinicians and genetic testers Cope with expanding demands on compute Provide a secure environment Collect provenance
  11. 11. Architecture overviewAll user interaction User data stored in Data for all tools and Web Servicesvia web interface the Cloud stored in the Cloud Input SNPs Web Storage Ensembl Cache interface (S3) (mySQL) (S3)Results Workflow Taverna Taverna Application specific tools Taverna Taverna engine e-Hive Server Server and Web Services Application specific tools and Application specific tools and Server orchestrato WSWebServices WebWS Too Too WS Services r other l l Unified access to different Tools and Web Services for workflow engines with our each workflow are installed common REST API together for easy replication
  12. 12. Workflow engine orchestration Workflow engine  Orchestrator is workflow orchestrator executor agnostic  Uses common API to: Common REST API  List workflows  Configure runs e-Hive TavernaInterface Interface Cache  Start runs  Manage current runs Engine specific APIs  Status  Progress e-Hive Taverna  Delete runs
  13. 13. Additional Taverna Functionality Integration with Cloud infrastructure  AWS first Read/write files securely to S3 Start and stop Cloud instances if required  Tool and Web Service scaling  Self-scaling Released as part of Taverna 3
  14. 14. The user’s view Curated set of workflows  Designed, built and tested by domain experts  Quality assurance tested (if appropriate) Workflows are presented as applications  The workflows themselves are hidden  Configured and run via a web interface All user data stored securely in the Cloud  User separation Workflows as a Service
  15. 15. Web interface: Overview Upload input data Configure workflow runs with  Input parameters  Uploaded data  Reused output data Start workflow runs Monitor workflow runs View results preview Download complete results
  16. 16. Web interface: Getting started
  17. 17. Web interface: Creating a Run
  18. 18. Web interface: Checking run progress
  19. 19. A Typical Workflow Parse files from SNP calling machines Annotate SNPs Predict effects (BioMart, VEP, polyphen)
  20. 20. Workflow as a Service The workflow IS the service Run restricted sets of Taverna workflows in the cloud Connects to other cloud based resources – storage, tools etc Users can tweak parameters, but not design their own Web portal access for scientists Data passed by reference instead of file Pay as you go – cheap at the point of use Elastic and available now
  21. 21. Acknowledgements/Partners University of Manchester Eagle Genomics Technology Strategy Board  100932 - Cloud Analytics for Life Sciences National Health Service Amazon Web Services