Workflows on the Cloud:        Scaling for National Service Katy Wolstencroft, Robert Haines, Helen Hulme,Mike Cornell, Sh...
Motivation: Workflows for                                        DiagnosticsNHS genetic testing, e.g. colon diseaseAnnotat...
Taverna Workflows   Sophisticated analysis pipelines   A set of services to analyse or    manage data (either local or  ...
Taverna                                    http://www.taverna.org.uk/     Freely available       open source   Current Ver...
SNP annotationAnnotation taskLocation, Gene, TranscriptPresent in public databases,dbSNP etc                            ...
Variant Classification                              SNPNonsense: base         Synonymous            Missense: Non-insertio...
SNP Filtering / TriageWhich SNPs are the most important?Reduction of 80K data points to those withpotential clinical signi...
Workflow   Taverna’s “Tool Service” feature –    used to wrap Perl scripts and other    command line applications   Uses...
Workflow ProvenanceRecord inferences in clinical decisions   What were the parameters used to build the    dataset   Wha...
Infrastructure Requirements   Execute analysis workflows   Accessible to clinicians and genetic testers   Cope with exp...
Architecture overviewAll user interaction      User data stored in          Data for all tools and Web Servicesvia web int...
Workflow engine orchestration  Workflow engine                     Orchestrator is workflow   orchestrator               ...
Additional Taverna Functionality   Integration with Cloud infrastructure       AWS first   Read/write files securely to...
The user’s view   Curated set of workflows       Designed, built and tested by domain experts       Quality assurance t...
Web interface: Overview   Upload input data   Configure workflow runs with       Input parameters       Uploaded data ...
Web interface: Getting started
Web interface: Creating a Run
Web interface: Checking run progress
A Typical Workflow   Parse files from SNP calling    machines   Annotate SNPs   Predict effects (BioMart, VEP,    polyp...
Workflow as a Service   The workflow IS the service    Run restricted sets of Taverna workflows in the cloud    Connect...
Acknowledgements/Partners   University of    Manchester   Eagle Genomics   Technology Strategy    Board       100932 -...
Upcoming SlideShare
Loading in …5
×

Wolstencroft K - Workflows on the Cloud: scaling for national service

622
-1

Published on

Presentation at BOSC2012 by Wolstencroft K - Workflows on the Cloud: scaling for national service

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
622
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Diagnostics is increasingly using nex gen seq methods – these are replacing sequencing of specific exons. The key difference is that the methods now usually look at a pre-decided set of genes, and check for presence of a set of “well known” variants. The new method results in many K of SNPs which must all be triaged. Example of where next gen gives benefit: Hereditary blindness >100 potential genes to look at. Less costly to NextGen than seq individual genes.
  • Carole’s concept of “Workflows for Ensemble work”
  • What were the parameters used to build the dataset What versions of databases, genome assembly, machine Where does each piece of evidence for/against pathogenicity originate from?
  • OpenAM ( http://www.forgerock.com/openam.html ) Not sure the “AM” actually stands for anything specific now. It used to be called OpenSSO when Sun first created it (SSO means Single Sign-On) Used for centralized authentication, authorization, entitlements and federation services Which basically means user sign-on for what we are using it for.
  • AWS == Amazon Web Services (ie the Amazon Cloud) S3 is Amazon’s Simple Storage Service Taverna 3 should come end 2012
  • Variant effect predictor Biomart
  • Wolstencroft K - Workflows on the Cloud: scaling for national service

    1. 1. Workflows on the Cloud: Scaling for National Service Katy Wolstencroft, Robert Haines, Helen Hulme,Mike Cornell, Shoaib Sufi, Andy Brass, Carole Goble University of Manchester, UK Madhu Donepudi, Nick James Eagle Genomics Ltd, UK
    2. 2. Motivation: Workflows for DiagnosticsNHS genetic testing, e.g. colon diseaseAnnotation of SNPs in patient data, ready for interpretation by clinician.Diagnostic Testing TodayPurify DNA. PCRs exons of relevant genes (MLH1, MSH2, MSH6).Sequence, identify variants, classify: (pathogenic, not pathogenic,unknown significance etc.).Writes report to clinicianDiagnostic Testing Tomorrow (or later today) uses whole genomesequencing ANNOTATE, FILTER, DISPLAY Next Gen Seq Variation data data New problem: How do we classify all the variants that we discover?
    3. 3. Taverna Workflows Sophisticated analysis pipelines A set of services to analyse or manage data (either local or remote) Workflows run through the workbench or via a server Automation of data flow through services Control of service invocation Iteration over data sets Provenance collection Extensible and open source
    4. 4. Taverna http://www.taverna.org.uk/ Freely available open source Current Version 2.4 80,000+ downloads across versionPart of the myGrid Toolkit Windows/Mac OS X/ Linux/unix Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W729-32. Taverna: a tool for building and running workflows of services. Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T.
    5. 5. SNP annotationAnnotation taskLocation, Gene, TranscriptPresent in public databases,dbSNP etc Workflows are goodFrequency in e.g. 1000 genome for collecting anddata integrating data from a variety of sources,Conservation data (cross species) into one place
    6. 6. Variant Classification SNPNonsense: base Synonymous Missense: Non-insertion, causing a synonymousframeshift Affects on function Affects on splicing or splicingPremature StopNonsense codon
    7. 7. SNP Filtering / TriageWhich SNPs are the most important?Reduction of 80K data points to those withpotential clinical significance.CriteriaReduce to (disease)-specific gene listSense < Missense < Stop codon etcBased on prediction tool scoresFrequency in population (based on 1000 genome data etc)(high frequency implies non deleterious)Conservation across species (implies that change isdeleterious)
    8. 8. Workflow Taverna’s “Tool Service” feature – used to wrap Perl scripts and other command line applications Uses VEP (Ensembl) Passes references to files
    9. 9. Workflow ProvenanceRecord inferences in clinical decisions What were the parameters used to build the dataset What versions of databases, genome assembly, machine Where does each piece of evidence for/against pathogenicity originate from?
    10. 10. Infrastructure Requirements Execute analysis workflows Accessible to clinicians and genetic testers Cope with expanding demands on compute Provide a secure environment Collect provenance
    11. 11. Architecture overviewAll user interaction User data stored in Data for all tools and Web Servicesvia web interface the Cloud stored in the Cloud Input SNPs Web Storage Ensembl Cache interface (S3) (mySQL) (S3)Results Workflow Taverna Taverna Application specific tools Taverna Taverna engine e-Hive Server Server and Web Services Application specific tools and Application specific tools and Server orchestrato WSWebServices WebWS Too Too WS Services r other l l Unified access to different Tools and Web Services for workflow engines with our each workflow are installed common REST API together for easy replication
    12. 12. Workflow engine orchestration Workflow engine  Orchestrator is workflow orchestrator executor agnostic  Uses common API to: Common REST API  List workflows  Configure runs e-Hive TavernaInterface Interface Cache  Start runs  Manage current runs Engine specific APIs  Status  Progress e-Hive Taverna  Delete runs
    13. 13. Additional Taverna Functionality Integration with Cloud infrastructure  AWS first Read/write files securely to S3 Start and stop Cloud instances if required  Tool and Web Service scaling  Self-scaling Released as part of Taverna 3
    14. 14. The user’s view Curated set of workflows  Designed, built and tested by domain experts  Quality assurance tested (if appropriate) Workflows are presented as applications  The workflows themselves are hidden  Configured and run via a web interface All user data stored securely in the Cloud  User separation Workflows as a Service
    15. 15. Web interface: Overview Upload input data Configure workflow runs with  Input parameters  Uploaded data  Reused output data Start workflow runs Monitor workflow runs View results preview Download complete results
    16. 16. Web interface: Getting started
    17. 17. Web interface: Creating a Run
    18. 18. Web interface: Checking run progress
    19. 19. A Typical Workflow Parse files from SNP calling machines Annotate SNPs Predict effects (BioMart, VEP, polyphen)
    20. 20. Workflow as a Service The workflow IS the service Run restricted sets of Taverna workflows in the cloud Connects to other cloud based resources – storage, tools etc Users can tweak parameters, but not design their own Web portal access for scientists Data passed by reference instead of file Pay as you go – cheap at the point of use Elastic and available now
    21. 21. Acknowledgements/Partners University of Manchester Eagle Genomics Technology Strategy Board  100932 - Cloud Analytics for Life Sciences National Health Service Amazon Web Services
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×