Wolstencroft K - Workflows on the Cloud: scaling for national service
Upcoming SlideShare
Loading in...5
×
 

Wolstencroft K - Workflows on the Cloud: scaling for national service

on

  • 750 views

Presentation at BOSC2012 by Wolstencroft K - Workflows on the Cloud: scaling for national service

Presentation at BOSC2012 by Wolstencroft K - Workflows on the Cloud: scaling for national service

Statistics

Views

Total Views
750
Views on SlideShare
750
Embed Views
0

Actions

Likes
0
Downloads
11
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Diagnostics is increasingly using nex gen seq methods – these are replacing sequencing of specific exons. The key difference is that the methods now usually look at a pre-decided set of genes, and check for presence of a set of “well known” variants. The new method results in many K of SNPs which must all be triaged. Example of where next gen gives benefit: Hereditary blindness >100 potential genes to look at. Less costly to NextGen than seq individual genes.
  • Carole’s concept of “Workflows for Ensemble work”
  • What were the parameters used to build the dataset What versions of databases, genome assembly, machine Where does each piece of evidence for/against pathogenicity originate from?
  • OpenAM ( http://www.forgerock.com/openam.html ) Not sure the “AM” actually stands for anything specific now. It used to be called OpenSSO when Sun first created it (SSO means Single Sign-On) Used for centralized authentication, authorization, entitlements and federation services Which basically means user sign-on for what we are using it for.
  • AWS == Amazon Web Services (ie the Amazon Cloud) S3 is Amazon’s Simple Storage Service Taverna 3 should come end 2012
  • Variant effect predictor Biomart

Wolstencroft K - Workflows on the Cloud: scaling for national service Wolstencroft K - Workflows on the Cloud: scaling for national service Presentation Transcript

  • Workflows on the Cloud: Scaling for National Service Katy Wolstencroft, Robert Haines, Helen Hulme,Mike Cornell, Shoaib Sufi, Andy Brass, Carole Goble University of Manchester, UK Madhu Donepudi, Nick James Eagle Genomics Ltd, UK
  • Motivation: Workflows for DiagnosticsNHS genetic testing, e.g. colon diseaseAnnotation of SNPs in patient data, ready for interpretation by clinician.Diagnostic Testing TodayPurify DNA. PCRs exons of relevant genes (MLH1, MSH2, MSH6).Sequence, identify variants, classify: (pathogenic, not pathogenic,unknown significance etc.).Writes report to clinicianDiagnostic Testing Tomorrow (or later today) uses whole genomesequencing ANNOTATE, FILTER, DISPLAY Next Gen Seq Variation data data New problem: How do we classify all the variants that we discover?
  • Taverna Workflows Sophisticated analysis pipelines A set of services to analyse or manage data (either local or remote) Workflows run through the workbench or via a server Automation of data flow through services Control of service invocation Iteration over data sets Provenance collection Extensible and open source
  • Taverna http://www.taverna.org.uk/ Freely available open source Current Version 2.4 80,000+ downloads across versionPart of the myGrid Toolkit Windows/Mac OS X/ Linux/unix Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W729-32. Taverna: a tool for building and running workflows of services. Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T.
  • SNP annotationAnnotation taskLocation, Gene, TranscriptPresent in public databases,dbSNP etc Workflows are goodFrequency in e.g. 1000 genome for collecting anddata integrating data from a variety of sources,Conservation data (cross species) into one place
  • Variant Classification SNPNonsense: base Synonymous Missense: Non-insertion, causing a synonymousframeshift Affects on function Affects on splicing or splicingPremature StopNonsense codon
  • SNP Filtering / TriageWhich SNPs are the most important?Reduction of 80K data points to those withpotential clinical significance.CriteriaReduce to (disease)-specific gene listSense < Missense < Stop codon etcBased on prediction tool scoresFrequency in population (based on 1000 genome data etc)(high frequency implies non deleterious)Conservation across species (implies that change isdeleterious)
  • Workflow Taverna’s “Tool Service” feature – used to wrap Perl scripts and other command line applications Uses VEP (Ensembl) Passes references to files
  • Workflow ProvenanceRecord inferences in clinical decisions What were the parameters used to build the dataset What versions of databases, genome assembly, machine Where does each piece of evidence for/against pathogenicity originate from?
  • Infrastructure Requirements Execute analysis workflows Accessible to clinicians and genetic testers Cope with expanding demands on compute Provide a secure environment Collect provenance
  • Architecture overviewAll user interaction User data stored in Data for all tools and Web Servicesvia web interface the Cloud stored in the Cloud Input SNPs Web Storage Ensembl Cache interface (S3) (mySQL) (S3)Results Workflow Taverna Taverna Application specific tools Taverna Taverna engine e-Hive Server Server and Web Services Application specific tools and Application specific tools and Server orchestrato WSWebServices WebWS Too Too WS Services r other l l Unified access to different Tools and Web Services for workflow engines with our each workflow are installed common REST API together for easy replication
  • Workflow engine orchestration Workflow engine  Orchestrator is workflow orchestrator executor agnostic  Uses common API to: Common REST API  List workflows  Configure runs e-Hive TavernaInterface Interface Cache  Start runs  Manage current runs Engine specific APIs  Status  Progress e-Hive Taverna  Delete runs
  • Additional Taverna Functionality Integration with Cloud infrastructure  AWS first Read/write files securely to S3 Start and stop Cloud instances if required  Tool and Web Service scaling  Self-scaling Released as part of Taverna 3
  • The user’s view Curated set of workflows  Designed, built and tested by domain experts  Quality assurance tested (if appropriate) Workflows are presented as applications  The workflows themselves are hidden  Configured and run via a web interface All user data stored securely in the Cloud  User separation Workflows as a Service
  • Web interface: Overview Upload input data Configure workflow runs with  Input parameters  Uploaded data  Reused output data Start workflow runs Monitor workflow runs View results preview Download complete results
  • Web interface: Getting started
  • Web interface: Creating a Run
  • Web interface: Checking run progress
  • A Typical Workflow Parse files from SNP calling machines Annotate SNPs Predict effects (BioMart, VEP, polyphen)
  • Workflow as a Service The workflow IS the service Run restricted sets of Taverna workflows in the cloud Connects to other cloud based resources – storage, tools etc Users can tweak parameters, but not design their own Web portal access for scientists Data passed by reference instead of file Pay as you go – cheap at the point of use Elastic and available now
  • Acknowledgements/Partners University of Manchester Eagle Genomics Technology Strategy Board  100932 - Cloud Analytics for Life Sciences National Health Service Amazon Web Services