Programming scientific data pipelines withthe Taverna workflow management system              Dr. Paolo Missier         Sc...
Outline        Objective:                  to provide a practical introduction to workflow                        systems ...
Workflows in science    High level programming models for scientific applications                              • Specifica...
What are Workflows used for?    EarthSciences   Life Sciences4
Taverna    • First released 2004    • Current version Taverna 2.2    • Currently 1500+ users per month, 350+ organizations...
Who else is in this space?                          Trident                                             Triana            ...
Who else is in this space?                          Trident                                             Triana            ...
Example: the BioAID workflow    Purpose:    The workflow extracts protein names from documents retrieved from    MedLine b...
The workflows eco-system in myGridA process-centric science lifecycle
The workflows eco-system in myGridA process-centric science lifecycleService discovery   and import
The workflows eco-system in myGridA process-centric science lifecycleService discovery   and import       Data            ...
The workflows eco-system in myGridA process-centric science lifecycleService discovery   and import       Data            ...
Workflow as data integrator
Workflow as data integrator  QTLgenomicregions  genes in QTLmetabolicpathways (KEGG)
Workflow as data integrator  QTLgenomicregions  genes in QTLmetabolicpathways (KEGG)
Taverna computational model (very briefly)                                  List-structured                               ...
Taverna computational model (very briefly)                                  List-structured                               ...
From services and scripts to workflows     • the BioAID workflow again:       – http://www.myexperiment.org/workflows/154....
Service composition requires adaptersExample: SBML model optimisation workflow -- designed by Peter Lihttp://www.myexperim...
Service composition requires adaptersExample: SBML model optimisation workflow -- designed by Peter Lihttp://www.myexperim...
Service composition requires adaptersExample: SBML model optimisation workflow -- designed by Peter Lihttp://www.myexperim...
Building workflows from existing services     • Large collection of available services       – default but extensible pale...
Incorporating R scripts into Taverna                                  Requirements for using R in a local installation:   ...
Integration between Taverna and eScience Central     • An example of integration between       – Taverna workflows (deskto...
Plugin: Excel spreadsheets as workflow input     • Third-party plugin code can later be bundled in a distribution     • Ex...
Plugin: Excel spreadsheets as workflow input     • Third-party plugin code can later be bundled in a distribution     • Ex...
Plugin: Excel spreadsheets as workflow input     • Third-party plugin code can later be bundled in a distribution     • Ex...
Taverna Model of Computation: a closer look     • Arcs between two ports define data dependencies       – processors with ...
List processing model     • Consider the gene-enzymes workflow from the previous demo:     Values can be either atomic or ...
Implicit iteration over lists     Demo:     – reload KEGG-genes-enzymes-atomicInput.t2flow     – declare input geneID to b...
Functional model for collection processing /1     Simple processing:     service expects atomic values,     receives atomi...
Functional model for collection processing /1     Simple processing:                       Simple iteration:     service e...
Functional model for collection processing /1     Simple processing:                           Simple iteration:     servi...
Functional model for collection processing /1       Simple processing:                             Simple iteration:      ...
Functional model /2     The simple iteration model                v = [[...], ...[...]]     generalises by induction to a ...
Functional model /2     The simple iteration model                             v = [[...], ...[...]]     generalises by in...
Functional model - multiple inputs /3                                        v2 = [v21 ... v2k]     v1 = [v11 ... vin]    ...
Functional model - multiple inputs /3                                        v2 = [v21 ... v2k]     v1 = [v11 ... vin]    ...
Generalised cross product     Binary product, δ = 1:   a × b = [[ ai , bj ]|bj ← b]|ai ← a]                              (...
Generalised cross product     Binary product, δ = 1:             a × b = [[ ai , bj ]|bj ← b]|ai ← a]                     ...
Generalised cross product     Binary product, δ = 1:               a × b = [[ ai , bj ]|bj ← b]|ai ← a]                   ...
Parallelism in the dataflow model     • The data-driven model with implicit iterations provides       opportunities for pa...
Exploiting latent parallelism     [ a, b, c,...]     [ (echo_1 a),     (echo_1 b),   (echo_1 c)]     (echo_2 (echo_1 a))  ...
Performance - experimental setup                           • previous version of Taverna engine used as baseline          ...
Performance study: Experimental setup - I     • Programmatically generated dataflows     – the “T-towers”     parameters: ...
caGrid workflow for performance analysis     Goal: perform cancer diagnosis using microarray analysis     - learn a model ...
caGrid workflow for performance analysis     Goal: perform cancer diagnosis using microarray analysis     - learn a model ...
caGrid workflow for performance analysis     Goal: perform cancer diagnosis using microarray analysis     - learn a model ...
caGrid workflow for performance analysis     Goal: perform cancer diagnosis using microarray analysis     - learn a model ...
caGrid workflow for performance analysis     Goal: perform cancer diagnosis using microarray analysis     - learn a model ...
Results I - Memory usage                                           shorter execution                                      ...
Results I - Memory usage                                           shorter execution                                      ...
Results II - Available processors pool     pipelining in T2 makes up for smaller pools of threads/processor30
Results III - Bounded main memory usage     Separation of data and process spaces ensures scalable data management        ...
Ongoing effort: Taverna on the Cloud     • Early experiments on running multiple instances of Taverna       workflows in a...
Summary     •   Workflows: high-level programming paradigm     •   Bridges the gap between scientists and developers     •...
ADDITIONAL MATERIAL     • Provenance of workflow data     • Provenance and Trust of Web data34
Example workflow (Taverna)chr: 17                          QTL →start: 28500000end: 3000000                  Ensembl Genes...
Baseline provenance of a workflow run                         QTL →                                mmu:12575              ...
Motivation for fine-grained provenance                                List-structured                                KEGG ...
Motivation for fine-grained provenance                                List-structured                                KEGG ...
Motivation for fine-grained provenance                                List-structured                                KEGG ...
Motivation for fine-grained provenance                                List-structured                                KEGG ...
Motivation for fine-grained provenance                                List-structured                                KEGG ...
Efficient query processing: main resultWorkflow graph                                        Provenance graph           X ...
Efficient query processing: main resultWorkflow graph                                        Provenance graph             ...
Efficient query processing: main resultWorkflow graph                                        Provenance graph             ...
Trust and provenance for Web data     • Testimonials: http://www.w3.org/2005/Incubator/prov/        – "At the toolbar (men...
Provenance graphs and belief networks       Intuition:       As news propagate, so do trust and quality judgments about th...
From process graph to provenance graph                                      dx          A1                  C1     P1     ...
From process graph to provenance graph                                                      Quality Control points        ...
From process graph to provenance graph                                                                      Quality Contro...
From provenance graph to belief network                                                                                pro...
Upcoming SlideShare
Loading in …5
×

Internal seminar @Newcastle University, Feb 2011

724 views
654 views

Published on

Programming scientific data pipelines with
the Taverna workflow management system

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
724
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • workflow walk through:\n* run using 10 max hits and default inputs*\n*open processors and beanshell boxes as a preview of what the workflow contains*\ndata dependencies. no control dependencies\nstructural nesting / modularization\nsimple input is a query string.\nprocessors are either service operation invocations, or shell-type scripts\nexecution is all local except the calls out to the services. shell interpreters are local as well\nnote areas of the workbench.\nzoom into any of the nested workflows\n- show intermediate values\n
  • in scope: \n- design: features available through the workbench\nexecution: local mode and server-based execution\nBiocatalogue, myExperiment if time permits or on demand\n\n
  • in scope: \n- design: features available through the workbench\nexecution: local mode and server-based execution\nBiocatalogue, myExperiment if time permits or on demand\n\n
  • in scope: \n- design: features available through the workbench\nexecution: local mode and server-based execution\nBiocatalogue, myExperiment if time permits or on demand\n\n
  • Taverna workflows are essentially programmable service orchestrations\nTaverna as a data integration model\n
  • Taverna workflows are essentially programmable service orchestrations\nTaverna as a data integration model\n
  • \n
  • \n
  • \n
  • \n\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Key observation: one can add your services but then there is very little support on how to connect its ports\n -- no type system, for example\n\n
  • \n
  • the task becomes a processor when it is added to a workflow\nthe processor has one port for each operation \n
  • the task becomes a processor when it is added to a workflow\nthe processor has one port for each operation \n
  • the task becomes a processor when it is added to a workflow\nthe processor has one port for each operation \n
  • have an Rserve running locally. start it like so:\nlibrary(Rserve)\nRserve(args="--no-save") \n\nstart a new workflow. Add R script to it with this content:\npng(g);\nplot(rnorm(1:100));\ndev.off();\n\nOR: load R-simple-graphics.t2flow\n
  • load the example weather workflow in T2.2.0:\nexample_workflow_for_rest_and_xpath_activities_650957.t2flow\n\n -- won’t work in earlier versions as these are new plugins\n
  • show BIOAid plugin in my 2.1.2\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • run workflow: spreadsheed_data_import_example_492836.t2flow\nin 2.2.0\n
  • run workflow: spreadsheed_data_import_example_492836.t2flow\nin 2.2.0\n
  • \n
  • \n
  • \n
  • reload KEGG-genes-enzymes-atomicInput.t2flow\ndeclare input geneID to be of depth 1\ninput two genes: \n mmu:26416\n mmu:328788\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Demo: show this workflow in action:\ngeneratedLargeList.t2flow\nI1 = 10 \nlist size = 10\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • On a typical QTL region with ~150k base pairs, one execution of this workflow finds about 50 Ensembl genes. \nThese could correspond to about 30 genes in the UniProt database, and 60 in the NCBI Etrez genes database. \nEach gene may be involved in a number of pathways, for example the mouse genes Mapk13 (mmu:26415) and \nCdkn1a (mmu:12575) participate to 13 and 9 pathways, respectively. \n\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • http://www.w3.org/2005/Incubator/prov/XGR-prov-20101214/#News_Aggregator_Scenario\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Internal seminar @Newcastle University, Feb 2011

    1. 1. Programming scientific data pipelines withthe Taverna workflow management system Dr. Paolo Missier School of Computing Science Newcastle University, UK Newcastle, Feb. 2011 With thanks to the myGrid team in Manchester for contributing their material and time
    2. 2. Outline Objective: to provide a practical introduction to workflow systems for scientific applications • Workflows in context: lifecycle and the workflow eco-system • Workflows for data integration • The user experience: from services and scripts to workflows • Extensibility I: importing services – using R scripts • Extensibility II: plugins • The Taverna computation model: a closer look – functional model, dataflow parallelism • Performance2
    3. 3. Workflows in science High level programming models for scientific applications • Specification of service / components execution orchestration • Handles cross cutting concerns like error handling, service invocation, data movement, data streaming, provenance tracking….. • A workflow is a specification configured for each run3
    4. 4. What are Workflows used for? EarthSciences Life Sciences4
    5. 5. Taverna • First released 2004 • Current version Taverna 2.2 • Currently 1500+ users per month, 350+ organizations, ~40 countries, 80,000+ downloads across versions • Freely available, open source LGPL • Windows, Mac OS, and Linux • http://www.taverna.org.uk • User and developer workshops • Documentation • Public Mailing list and direct email support http://www.taverna.org.uk/introduction/taverna-in-use/5
    6. 6. Who else is in this space? Trident Triana VisTrails Kepler Taverna Pegasus (ISI)6
    7. 7. Who else is in this space? Trident Triana VisTrails Kepler Taverna Pegasus (ISI)6
    8. 8. Example: the BioAID workflow Purpose: The workflow extracts protein names from documents retrieved from MedLine based on a user Query (cf Apache Lucene syntax). The protein names are filtered by checking if there exists a valid UniProt ID for the given protein name. Credits: - Marco Roos (workflow), - text mining services by Sophia Katrenko and Edgar Meij (AID), and Martijn Schuemie (BioSemantics, Erasmus University Rotterdam). Available from myExperiment: http://www.myexperiment.org/workflows/154.html7
    9. 9. The workflows eco-system in myGridA process-centric science lifecycle
    10. 10. The workflows eco-system in myGridA process-centric science lifecycleService discovery and import
    11. 11. The workflows eco-system in myGridA process-centric science lifecycleService discovery and import Data Metadata Methods - inputs - provenance - the workflow - parameters - annotations - results
    12. 12. The workflows eco-system in myGridA process-centric science lifecycleService discovery and import Data Metadata Methods - inputs - provenance - the workflow - parameters - annotations - results
    13. 13. Workflow as data integrator
    14. 14. Workflow as data integrator QTLgenomicregions genes in QTLmetabolicpathways (KEGG)
    15. 15. Workflow as data integrator QTLgenomicregions genes in QTLmetabolicpathways (KEGG)
    16. 16. Taverna computational model (very briefly) List-structured KEGG gene ids: geneIDs pathways [ [ mmu:26416 ], • • • • [ mmu:328788 ] ] • • • • • • geneIDs pathways • • • • • • • • • • [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
    17. 17. Taverna computational model (very briefly) List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] • Collection processing • Simple type system • no record / tuple structure • data driven computation • with optional processor synchronisation • parallel processor activation • greedy (no scheduler) [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
    18. 18. From services and scripts to workflows • the BioAID workflow again: – http://www.myexperiment.org/workflows/154.html • overall composition: • 15 beanshell and other local scripts – mostly for data formatting • 4 WSDL-based service operations: operation service getUniprotID synsetServer queryToArray tokenize apply applyCRFService search SearcherWSService11
    19. 19. Service composition requires adaptersExample: SBML model optimisation workflow -- designed by Peter Lihttp://www.myexperiment.org/workflows/1201
    20. 20. Service composition requires adaptersExample: SBML model optimisation workflow -- designed by Peter Lihttp://www.myexperiment.org/workflows/1201
    21. 21. Service composition requires adaptersExample: SBML model optimisation workflow -- designed by Peter Lihttp://www.myexperiment.org/workflows/1201 String[] lines = inStr.split("n"); StringBuffer sb = new StringBuffer(); for(i = 1; i < lines.length -1; i++) { String str = lines[i]; str = str.replaceAll("<result>", ""); str = str.replaceAll("</result>", ""); sb.append(str.trim() + "n"); } String outStr = sb.toString(); Url -> content (built-in shell script) import java.util.regex.Pattern; import java.util.regex.Matcher; sb = new StringBuffer(); p = "CHEBI:[0-9]+"; Pattern pattern = Pattern.compile(p); Matcher matcher = pattern.matcher(sbrml); while (matcher.find()) { sb.append("urn:miriam:obo.chebi:" + matcher.group() + ","); } String out = sb.toString(); //Clean up if(out.endsWith(",")) out = out.substring(0, out.length()-1); chebiIds = out.split(",");
    22. 22. Building workflows from existing services • Large collection of available services – default but extensible palette of services in the workbench – mostly third party – All the major providers: NCBI, DDBJ, EBI … A plethora of providers: For an example of how to build a simple workflow, please follow Exercise 3 from this tutorial13
    23. 23. Incorporating R scripts into Taverna Requirements for using R in a local installation: - install R from main archive site: http://cran.r-project.org/ - install Rserve: http://www.rforge.net/Rserve/ - start Rserve locally: - start the R console and type the commands: library(Rserve) Rserve(args="--no-save") Taverna can display graphical output from R The following R script simply produces a png image that is displayed on the Taverna output: png(g); plot(rnorm(1:100)); dev.off(); To use it, create an R Taverna workflow with output port g - of type png image See also: http://www.mygrid.org.uk/usermanual1.7/rshell_processor.html14
    24. 24. Integration between Taverna and eScience Central • An example of integration between – Taverna workflows (desktop) – the eScience Central cloud environment • Facilitated by Taverna’s plugin architecture • See http://www.cs.man.ac.uk/~pmissier/T-eSC-integration.svg15
    25. 25. Plugin: Excel spreadsheets as workflow input • Third-party plugin code can later be bundled in a distribution • Ex.: importing input data from a spreadsheet – see: http://www.myexperiment.org/workflows/1417.html – and example input spreadsheet: http://www.myexperiment.org/files/410.html16
    26. 26. Plugin: Excel spreadsheets as workflow input • Third-party plugin code can later be bundled in a distribution • Ex.: importing input data from a spreadsheet – see: http://www.myexperiment.org/workflows/1417.html – and example input spreadsheet: http://www.myexperiment.org/files/410.html16
    27. 27. Plugin: Excel spreadsheets as workflow input • Third-party plugin code can later be bundled in a distribution • Ex.: importing input data from a spreadsheet – see: http://www.myexperiment.org/workflows/1417.html – and example input spreadsheet: http://www.myexperiment.org/files/410.html16
    28. 28. Taverna Model of Computation: a closer look • Arcs between two ports define data dependencies – processors with inputs on all their (connected) ports are ready – no active scheduling: admission control is simply by the size of threads pool – processors fire as soon as they are ready and there are available threads in the pool • No control structures – no explicit branching or loop constructs • but dependencies between processors can be added: – end(P1) ➔ begin(P2) coordination link semantics: “fetch_annotations can only start after ImprintOutputAnnotator has completed” Typical pattern: writer ➔ reader (eg to external DB)17
    29. 29. List processing model • Consider the gene-enzymes workflow from the previous demo: Values can be either atomic or (nested) lists - values are of simple types (string, number,...) - but also mime types for images (see R example above) What happens if the input to our workflow is a list of gene IDs? geneID = [ mmu:26416, mmu:19094 ] we need to declare the input geneID to be of depth 1 - depth n in general, for a generic n-deep list18
    30. 30. Implicit iteration over lists Demo: – reload KEGG-genes-enzymes-atomicInput.t2flow – declare input geneID to be of depth 1 – input two genes, run – Each processor is activated once for each element in the list – this is because each is designed to accept an atomic value – the result is a nested list of results, one for each gene in the input list19
    31. 31. Functional model for collection processing /1 Simple processing: service expects atomic values, receives atomic values v1 v2 v3 X1 X2 X3 P Y1 Y2 w1 w220
    32. 32. Functional model for collection processing /1 Simple processing: Simple iteration: service expects atomic values, service expects atomic values, receives atomic values receives input list v = [v1 ... vn] v1 v2 v3 v = [v1 ... vn] v1 vn X1 X2 X3 X X X Y1 P Y2 Y P ➠ P1 Y ... Pn Y w1 w2 w = [w1 ... wn] w1 wn w = [w1 ... wn]20
    33. 33. Functional model for collection processing /1 Simple processing: Simple iteration: service expects atomic values, service expects atomic values, receives atomic values receives input list v = [v1 ... vn] v1 v2 v3 v = [v1 ... vn] v1 vn X1 ad = 1 X2 X3 X X X Y1 P Y2 dd = 0 Y P ➠ P1 Y ... Pn Y δ=1 w1 w2 w = [w1 ... wn] w1 wn w = [w1 ... wn]20
    34. 34. Functional model for collection processing /1 Simple processing: Simple iteration: service expects atomic values, service expects atomic values, receives atomic values receives input list v = [v1 ... vn] v1 v2 v3 v = [v1 ... vn] v1 vn X1 ad = 1 X2 X3 X X X Y1 P Y2 dd = 0 P Y ➠ P1 Y ... Pn Y δ=1 w1 w2 w = [w1 ... wn] w1 wn w = [w1 ... wn] v = [[...], ...[...]] Extension: service expects atomic ad =2 X values, receives input nested list dd = 0 P Y δ=220 w = [[..] ...[...]]
    35. 35. Functional model /2 The simple iteration model v = [[...], ...[...]] generalises by induction to a ad =n generic δ=n-m X dd = m P Y δ = n-m ≥ 0 w = [[..] ...[...]] - depth = n-m21
    36. 36. Functional model /2 The simple iteration model v = [[...], ...[...]] generalises by induction to a ad =n generic δ=n-m X dd = m P Y δ = n-m ≥ 0 w = [[..] ...[...]] - depth = n-m This leads to a recursive functional formulation for simple collection processing: v = a1 . . . an (P v) if l = 0 (evall P v) = (map (evall − 1 P ) v) if l > 021
    37. 37. Functional model - multiple inputs /3 v2 = [v21 ... v2k] v1 = [v11 ... vin] X1 X2 X3 v3 = [v31 ... v3m] P dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 Y dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 w = [ [w11 ... w1n], ... [wm1 ...wmn] ]22
    38. 38. Functional model - multiple inputs /3 v2 = [v21 ... v2k] v1 = [v11 ... vin] X1 X2 X3 v3 = [v31 ... v3m] P dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 Y dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 w = [ [w11 ... w1n], ... [wm1 ...wmn] ] Cross-product involving v1 and v3 (but not v2): v1 ⊗ v3 = [ [ <v1i, v3j> | j:1..m ] | i:1..n ] // cross product and including v2: [ [ <v1i, v2, v3j> | j:1..m ] | i:1..n ]22
    39. 39. Generalised cross product Binary product, δ = 1: a × b = [[ ai , bj ]|bj ← b]|ai ← a] (eval2 P a, b ) = (map (eval1 P ) a × b)23
    40. 40. Generalised cross product Binary product, δ = 1: a × b = [[ ai , bj ]|bj ← b]|ai ← a] (eval2 P a, b ) = (map (eval1 P ) a × b) Generalized to arbitrary depths:  [[(vi , wj )|wj ← w]|vi ← v]  if d1 > 0, d2 >0  [(v , w)|v ← v] i i if d1 > 0, d2 =0 (v, d1 ) ⊗ (w, d2 ) = [(v, wj )|wj ← w]  if d1 = 0, d2 >0   (v, w) if d1 = 0, d2 =0 ...and to n operands: ⊗i:1...n (vi , di )23
    41. 41. Generalised cross product Binary product, δ = 1: a × b = [[ ai , bj ]|bj ← b]|ai ← a] (eval2 P a, b ) = (map (eval1 P ) a × b) Generalized to arbitrary depths:  [[(vi , wj )|wj ← w]|vi ← v]  if d1 > 0, d2 >0  [(v , w)|v ← v] i i if d1 > 0, d2 =0 (v, d1 ) ⊗ (w, d2 ) = [(v, wj )|wj ← w]  if d1 = 0, d2 >0   (v, w) if d1 = 0, d2 =0 ...and to n operands: ⊗i:1...n (vi , di ) Finally: general functional semantics for collection-based processing (evall P (v1 , d1 ), . . . , (vn , dn ) ) (P v1 , . . . , vn ) if l = 0 = (map (evall−1 P ) ⊗i:1...n vi , di ) if l > 023
    42. 42. Parallelism in the dataflow model • The data-driven model with implicit iterations provides opportunities for parallel processing of workflows two types of parallelism: • intra-processor: implicit iteration over list data • inter-processor: pipelining [ id1, id2, id3, ...] SFH1 SFH2 SFH3 implicit assumption ... ... ... of independence amongst the threads that operate on elements of a list getDS1 getDS2 getDS3 [ DS1, DS2, DS3, ...]24
    43. 43. Exploiting latent parallelism [ a, b, c,...] [ (echo_1 a), (echo_1 b), (echo_1 c)] (echo_2 (echo_1 a)) (echo_2 (echo_1 b)) (echo_2 (echo_1 c)) See also: http://www.myexperiment.org/workflows/1372.html25
    44. 44. Performance - experimental setup • previous version of Taverna engine used as baseline • objective: to measure incremental improvement list generator Parameters: multiple parallel pipelines - byte size of list elements (strings) - size of input list - length of linear chain main insight: when the workflow is designed for pipelining, parallelism is exploited effectively26
    45. 45. Performance study: Experimental setup - I • Programmatically generated dataflows – the “T-towers” parameters: - size of the lists involved - length of the paths - includes one cross product27
    46. 46. caGrid workflow for performance analysis Goal: perform cancer diagnosis using microarray analysis - learn a model for lymphoma type prediction based on samples from different lymphoma types source: caGrid28 http://www/myexperiment.org/workflows/746
    47. 47. caGrid workflow for performance analysis Goal: perform cancer diagnosis using microarray analysis - learn a model for lymphoma type prediction based on samples from different lymphoma types lymphoma samples source: caGrid ➔ hybridization data28 http://www/myexperiment.org/workflows/746
    48. 48. caGrid workflow for performance analysis Goal: perform cancer diagnosis using microarray analysis - learn a model for lymphoma type prediction based on samples from different lymphoma types lymphoma samples source: caGrid ➔ hybridization data process microarray data as training dataset28 http://www/myexperiment.org/workflows/746
    49. 49. caGrid workflow for performance analysis Goal: perform cancer diagnosis using microarray analysis - learn a model for lymphoma type prediction based on samples from different lymphoma types lymphoma samples source: caGrid ➔ hybridization data process microarray data as training dataset learn predictive model28 http://www/myexperiment.org/workflows/746
    50. 50. caGrid workflow for performance analysis Goal: perform cancer diagnosis using microarray analysis - learn a model for lymphoma type prediction based on samples from different lymphoma types lymphoma samples source: caGrid ➔ hybridization data28 http://www/myexperiment.org/workflows/746
    51. 51. Results I - Memory usage shorter execution time due to pipelining T2 main memory data management T2 embedded Derby back-end T1 baseline list size: 1,000 strings of 10K chars each no intra-processor parallelism (1 thread/processor)29
    52. 52. Results I - Memory usage shorter execution time due to pipelining T2 main memory data management T2 embedded Derby back-end T1 baseline list size: 1,000 strings of 10K chars each no intra-processor parallelism (1 thread/processor)29
    53. 53. Results II - Available processors pool pipelining in T2 makes up for smaller pools of threads/processor30
    54. 54. Results III - Bounded main memory usage Separation of data and process spaces ensures scalable data management varying data element size: 10K, 25K, 100K chars31
    55. 55. Ongoing effort: Taverna on the Cloud • Early experiments on running multiple instances of Taverna workflows in a cloud environment • Coarse-grained cloud deployment: workflow-at-a-time – data partitioning ➔ each partition is allocated to a workflow instance For more details, please see: Paul Fisher, ECCB talk slides, October, 201032
    56. 56. Summary • Workflows: high-level programming paradigm • Bridges the gap between scientists and developers • Many workflow models available (commercial/open source) • Taverna implements a dataflow model – has proven useful for a broad variety of scientific applications Strengths: • Rapid prototyping given a base of third-party or own services • Explicit modelling of data integration processes • Extensibility: – for workflow designers: easy to import third-party services (SOAP, REST) – accepts scripts in a variety of languages – for developers: easy to add functionality using a plugin model • Good potential for parallelisation • Early experiments on cloud deployment: workflow-at-a-time – ongoing study for finer-grain deployment of portions of the workflow33 Back to the start
    57. 57. ADDITIONAL MATERIAL • Provenance of workflow data • Provenance and Trust of Web data34
    58. 58. Example workflow (Taverna)chr: 17 QTL →start: 28500000end: 3000000 Ensembl Genes Ensembl Gene → Ensembl Gene → Uniprot Gene Entrez Gene Uniprot Gene → Entrez Gene → Kegg Gene Kegg Gene merge gene IDs Gene → Pathway path:mmu04210 Apoptosis, path:mmu04010 MAPK, ...
    59. 59. Baseline provenance of a workflow run QTL → mmu:12575 Ensembl Genes v1 ... vn w Ensembl Gene → Ensembl Gene → Uniprot Gene Entrez Gene path:mmu04012 exec Uniprot Gene → Entrez Gene → a1 ... an b1 ... bm Kegg Gene Kegg Gene mmu:26416 merge gene IDs path:mmu04010 y11 ymn Gene → Pathway ... path:mmu04010→derives_from→mmu:26416 path:mmu04012→derives_from→mmu:12575 • The graph encodes all direct data dependency relations • Baseline query model: compute paths amongst sets of nodes • Transitive closure over data dependency relations36
    60. 60. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
    61. 61. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] geneIDs pathways • • • • • • • • • • [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
    62. 62. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] geneIDs pathways • • • • • • • • • • [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
    63. 63. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] geneIDs pathways • • • • • • • • • • [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
    64. 64. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] geneIDs pathways • • • • • • • • • • [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
    65. 65. Efficient query processing: main resultWorkflow graph Provenance graph X X v1 ... vn w Q R Y Y a1 ... an b1 ... bm X1 X2 X3 P y11 ... ymn Y y = [ [y11 ... y1n], ... [ym1 ... ymn] ] • Query the provenance of individual collections elements • But, avoid computing transitive closures on the provenance graph • Use workflow graph as index instead • Exploit workflow model semantics to statically predict dependencies on individual tree elements • This results in substantial performance improvement for typical queries
    66. 66. Efficient query processing: main resultWorkflow graph Provenance graph [1] [] X X v1 ... vn w Q R Y Y a1 ... an b1 ... bm [] X1 X2 X3 [n] [1] P y11 ... ymn Y y = [ [y11 ... y1n], ... [ym1 ... ymn] ] • Query the provenance of individual collections elements • But, avoid computing transitive closures on the provenance graph • Use workflow graph as index instead • Exploit workflow model semantics to statically predict dependencies on individual tree elements • This results in substantial performance improvement for typical queries
    67. 67. Efficient query processing: main resultWorkflow graph Provenance graph [1] [] X X v1 ... vn w Q R Y Y a1 ... an b1 ... bm [] X1 X2 X3 [n] [1] P y11 ... ymn Y y = [ [y11 ... y1n], ... [ym1 ... ymn] ] • Query the provenance of individual collections elements • But, avoid computing transitive closures on the provenance graph • Use workflow graph as index instead • Exploit workflow model semantics to statically predict dependencies on individual tree elements • This results in substantial performance improvement for typical queries
    68. 68. Trust and provenance for Web data • Testimonials: http://www.w3.org/2005/Incubator/prov/ – "At the toolbar (menu, whatever) associated with a document there is a button marked "Oh, yeah?". You press it when you lose that feeling of trust. " - Tim Berners-Lee, Web Design Issues, September 1997 – Provenance is the number one issue we face when publishing government data as linked data for data.gov.uk" - John Sheridan, UK National Archives, data.gov.uk, February 2010 how exactly is provenance-based quality checking going to work? Upcoming W3C Working Group on Provenance for Web data - a European initiative: chaired by Luc Moreau (Southampton), Paul Groth (NL)39
    69. 69. Provenance graphs and belief networks Intuition: As news propagate, so do trust and quality judgments about them • Is there a principled way to model this? • Idea: explore conceptual similarities between provenance graphs and belief networks (i.e. Bayesian networks) Standard Bayesian network example:40
    70. 70. From process graph to provenance graph dx A1 C1 P1 d1 S1 d1’ P2 d2 S2 d2’ data production data publishing41
    71. 71. From process graph to provenance graph Quality Control points dx A1 C1 P1 d1 S1 d1’ P2 d2 S2 d2’ data production data publishing41
    72. 72. From process graph to provenance graph Quality Control points dx A1 C1 P1 d1 S1 d1’ P2 d2 S2 d2’ data production data publishing provenance “used” graph for dx P(dx) “was generated by” d2’ S2 d2 P2 d1’ S1 d1 P1 “published” “was published by” C1 A141 curator author
    73. 73. From provenance graph to belief network provenance “used” graph for dx P(dx) “was generated by” d2’ S2 d2 P2 d1’ S1 d1 P1 “published” “was published by” C1 A1 curator author CPT QCP CPT CPT P1 A1 d1 S1 C1 - assume judgments are available at QCPs Pdx A2 d1’ - Where do the remaining conditional probabilities come from? d2 S2 - can judgments be 4227 propagated here? d2’

    ×