2011-06-08 Taverna workflow system

1,205 views

Published on

Taverna workflow system - presented by Stian Soiland-Reyes at ITER Integrated Modelling workshop in Cadarache, France on 2011-06-08.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,205
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

2011-06-08 Taverna workflow system

  1. 1. http://taverna.org.uk/  S"an  Soiland-­‐Reyes  &  Robert  Haines  myGrid,  School  of  Computer  Science   University  of  Manchester,  UK   ITER  IM  workshop   Château  de  Cadarache,  2011-­‐06-­‐08  
  2. 2. What  is  myGrid?    An  e-­‐Science  Collabora"on  Since  2001    Not  a  grid!    Numerous  partners  involved:     University  of  Manchester     University  of  Southampton     University  of  Oxford     EMBL-­‐EBI    Provides  sustainable  and  produc"on  quality  soTware     Supported  by  OMII-­‐UK,  EPSRC  and  BBSRC    Mixture  of  developers,  bioinforma"cians  and   researchers   SoTware  |  Services  |  Content  |  Skills  |  Community   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  3. 3. Mo"va"on:  Bioinforma)cs    Challenge:     Large  amounts  of  data     Many  open  ques"ons     Numerous  freely   available  public   datasets  and  analysis   tools   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  4. 4. Huge  amounts  of  data   Microarray   1000+  Genes   QTL  regions   100+  Genes   How  do  I  look   Next  Gen   at  all  the  genes   systema)cally?   Sequencing   100,000+   Genes   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  5. 5. Manual  approach    Search  using  public  web  sites  and  databases     Pubmed     Uniprot     EBI  BioMart    Copy  and  paste  to  web  tools  for  analysis     NCBI  Blast     EBI  InterPro    Further  processing  locally     R     Perl     Python   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  6. 6. Manual:  disadvantages  •  Scale  of  analysis  task  overwhelms  researchers   –  lots  of  data  •  User  bias  and  premature  filtering  of  datasets  –   cherry  picking  •  Hypothesis-­‐Driven  approach  to  data  analysis  •  Constant  changes  in  data  -­‐  problems  with  re-­‐ analysis  of  data  •  Implicit  methodologies  (hyper-­‐linking  through   web  pages)  •  Error  prolifera)on  from  any  of  the  listed  issues   –  notably  human  error   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  7. 7. Web  services  and  workflows    Web  services     Technology  and  standards  for  exposing  code  and   data  resources  that  can  be  programma)cally   consumed  by  a  remote  third  party     Descrip"on  on  how  to  interact  with  the  service,   parameters,  documenta"on    Workflows     General  technique  for  describing  and  execu"ng   a  process     Describe  what  you  want  to  do  running  which   services   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  8. 8. The Taverna Open Source Suite of Tools Web PortalsWorkflow Repository GUI Workbench Client User Interfaces Virtual Machine Third Party Tools Service Catalogue Workflow Engine Provenance Workflow Store ServerActivity and Service Plug-in Manager Open Provenance Model Programming and Secure Service Access APIs
  9. 9. Taverna  workflows   Workflow Inputs start_position chromosome_name end_position genes_in_qtl A  set  of  (local  and  remote)   mmusculus_gene_ensembl remove_entrez_duplicates remove_uniprot_duplicates create_report   services  to  analyze  or  manage   merge_entrez_genes merge_uniprot_ids remove_Nulls REMOVE_NULLS_2 data   add_ncbi_to_string add_uniprot_to_string Kegg_gene_ids_2 Kegg_gene_ids concat_kegg_genes   Nested  workflows  are  also   split_gene_ids regex_2 split_for_duplicates Get_pathways remove_duplicate_kegg_genes Workflow Inputs services   Data-­‐links  connects  services   regex gene_ids split_by_regex lister   get_pathways_by_genes1   i.e.  output  from  service  A  is  input  to   service  B  and  C   Merge_pathways concat_ids   Describes  the  desired  dataflow   concat_gene_pathway_ids Merge_gene_pathways instead  of  process  coordina"on   Workflow Outputs pathway_genes pathway_ids merge_pathway_list_1 merge_pathway_list_2 split_for_duplicate_pathways   Automa"c  itera"ons   Can  customize  list  handling  and   remove_duplicate_ids pathway_descriptions   control  links   gene_descriptions merge_genes_and_pathways remove_pathway_duplicates merge_gene_desc merge_genes_and_pathways_2 merge_pathway_desc remove_nulls_3 merge_genes_and_pathways_3 remove_pathway_nulls merge_patwhay_ids species kegg_pathway_releaseWorkflow Outputs flatten_pathway_files remove_pathway_nulls_2 merge_kegg_references merge_reports getcurrentdatabase binfo   gene_descriptions genes_pathways merged_pathways pathway_descriptions pathway_ids kegg_external_gene_reference report ensembl_database_release kegg_pathway_release http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  10. 10. What  types  of  services  and  data?    WSDL/SOAP  web  services     Secured  invoca"on  with  HTTPS/SSL/WS-­‐Security    RESTful  web  services     Secured  invoca"on  with  HTTPS/Basic  Auth    Spreadsheet  import    Command  line  tools  (local,  SSH)    Inline  scripts  (Beanshell,  R)    Excel/CSV  spreadsheets    Java  APIs    Customiza"ons:     BioMart,  BioMoby  /  SADI     Soaplab     Grid  services  (EGEE  gLite,  caGrid,  PBS,  UNICORE)     …  your  tool  (Plugin  tutorial  in  wiki)   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  11. 11. Service  limita"ons    Web  service  crea"on  involves  wrapping   exis"ng  tools  or  wri"ng  WS  code    Web  services  can  go  down       can  use  redundant  services  in  workflow       Service  monitoring    Transferring  data  up/down  to  WS  slow       Support  references  in  WS  interface    Execu"ng  command  line  tools  directly  requires   execu"on  access     Trickier  to  share  workflows,  require  either  SSH/grid   creden)als  or  installing  tools  locally   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  12. 12. Which  services?    Taverna  is  general,  can  connect  to  standard   web  services  and  command  line  tools  for  any   domain    in  bioinforma"cs..     From  professional  third-­‐party  organisa"ons   providing  robust  &  open  data/analysis  services     ..to  under-­‐the-­‐desk  web  services  for  one  par"cular   purpose,  ran  by  PhD  students       hhp://biocatalogue.org/  -­‐  2000+  services  from   140+  providers  –  crowd  sourced  and  quality   monitored   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  13. 13. http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  14. 14. BioCatalogue  integra"on    Search  services  from   workbench    Add  services  to  workflow    View  service  descrip)ons   and  up)me  status  from   within  workflow   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  15. 15. Taverna     workbench    Graphical  desktop  tool      No  server  installa"on   required    Drag-­‐and-­‐drop  services   into  diagram    Connect  services,  run,   reconnect,  rerun    Integrates  diverse  set   of  tools   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  16. 16. http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  17. 17. http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  18. 18. http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  19. 19. Sharing  workflows    myExperiment.org  allows  users  to  share,   find,  download  and  rate  workflows    “Facebook  for  the  scien"st”    4000+  members,  1400+  workflows    Open  source  code,  can  set  up  own  instance   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  20. 20. myExperiment  integra"on    Search  and  browse   workflows     By  tags     Free  text  search     Own/group  workflows     Packs,  e.g.  “Examples”    Upload/share  workflows   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  21. 21. Taverna  workflow  features    Nested  workflows     Reuse  exis"ng  components    Implicit  itera"ons     With  customizable  list  handling    Pipelining     Process  par"al  itera"on  results  early    Parallelisa"on     Run  as  soon  as  data  is  available    Retries,  failover,  looping     For  stability  and  condi"onal  tes"ng    Plugin-­‐extensible  execu"on  control     Ideas:  caching,  error  detec"on,  dynamic  service  lookup   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  22. 22. Extensible  UI  and  engine    Plugins  can  provide  new  “perspec"ves”     e.g.:  BioCatalogue,  myExperiment    Provide  service-­‐specific  customiza"on     e.g.:  BioMart  interface  replicates  web  site    Adding  new  func"onality     New  service  types,  eg:  …     Execu"on  control  like  looping/branching     Design  helpers,  “Find  matching  service”   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  23. 23. Workflow  limita"ons    Ini"ally  designed  for  dataflows     Not  suitable  for  business  processes  like  “HR   procedure  for  hiring  new  staff”     Long-­‐running  workflows  require  Taverna  Server     ..  But  suitable  for  coordina)ng  command  line   and  grid  execu"ons,  the  data  might  just  be  job   references     Execu"on  control  extensible,  eg:     Looping,  Branching     Dynamic  service  lookup     Data  manipula"on,  Error  detec"on   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  24. 24. Data  and  provenance  handling    Data  references  passed  between  services  in  workflow     http,  file,  sftp,  gridftp,  etc  (extensible)    Data  downloaded/uploaded  or  references  translated   when  needed    Provenance  captured  for  workflow  runs     Trace  execu"on  steps,  view  intermediate  values  while  running     Export  as  Open  Provenance  Model  (OPM)  /  RDF     Proof  and  origin  of  produced  outputs     Extensible  annota)ons    Wf4Ever:  reproducible  research  objects     Workflow/data  as  a  scien"fic  publica"on    preserva"on     Need  to  capture  more  service  data  and  metadata   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  25. 25. Data  limita"ons    Running  Workbench  limited  by:     Local  disk  space  for  storing  data     Network  speeds  for  up/download     Firewall  access       Execute  wf  using  Taverna  Server  or   command  line  remotely  with  ssh/job  submission    No  standardized  WS  reference  mechanism     Agree  on  mechanism  within  WS  ‘family’  with   shared  disk  (eg.  deconstruct  local  path  from   HTTP  URI)   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  26. 26. Parameter  sweeps    Implicit  itera)ons  with  pipelining  provides   an  intui"ve  way  to  set  up  parameter   sweeps    Advanced  looping  and  extensible  execu)on   control  allows  itera"ve  &  recursive   reduc"ons/approxima"ons   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  27. 27. Taverna  command  line    Executes  from  a   Windows/Linux/OSX   shells    Takes  a  predefined   workflow  with  files  as   inputs  and  outputs    Quick  way  to   “produc"onize”  a   workflow   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  28. 28. Taverna  Server    REST/SOAP  interface  to     execute  workflows    Client  libraries  for  Ruby  and  Java    Two  demonstra"on  web  interfaces     Ruby     Java  Portlets    Upcoming:     Security  delega"on     AWS  image   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  29. 29. Taverna  portlet    Example  portlet   interface    Executes  workflows   using  Taverna  Server   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  30. 30. http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  31. 31. Ruby  web  interface    Example  customized     Uses  Ruby  gem   web  interface   t2-­‐server   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  32. 32. Grids  and  clusters    Taverna  have  been  integrated  with  several   leading  grid  and  middleware   infrastructures,  such  as:     PBS     caGrid/Globus     EGEE/gLite     NorduGrid’s  ARC     JSDL/GridSAM    Plans  for  SAGA  integra"on   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  33. 33. Taverna  on  the  cloud    Use-­‐case:     SNP  analysis  and  annota"on  of   genome  sequenced  from   breeds  of  cows  in  Africa  –  why  are     some  of  them  resistent  to  X?     Amazon  EC2  with  Taverna  Server  and  local   services     Ruby  on  Rails  web  interface     Runs  through  31  chromosomes  in  2  hours  using   10  instances  -­‐  $10   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  34. 34. http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  35. 35. Taverna  3  roadmap    OSGi  plugin  system    Workflow  language:  Scufl2     Compound  format;  embedding  metadata,   dependencies,  independent  API  for  crea"ng/ inspec"ng  workflows    Components     Finding/sharing  command  line  tool  descrip"ons     Richer  way  of  finding  compa"ble  services   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  36. 36. Open  source,  open  development    Taverna  suite  of  tools  are  all  open  source,   free  to  use  and  customize    Large  user  community,  ac"ve  mailing  lists    Lead  developers:  myGrid  in  Manchester  UK    Contributors  from  across  the  world    PAL  programme    myGrid  provides  training,  tutorials  and   documenta)on   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  37. 37. Who  uses  Taverna?    Bioinforma"cs:  EMBL-­‐EBI,  ONDEX    Astronomy:  HELIO,  AstroGrid,  SAMPO    Engineering:  NASA  Jet  Propulsion  Lab  (JPL)    Chemistry:  CDK,  CIC    Biodiversity:  BioVel    Preserva"on:  Wf4Ever,  SCAPE    BioMedicine/Cancer  research:  caGrid    Data/text  mining:  eLico,  AID   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  38. 38. Taverna  in  numbers     myExperiment:       4000+  registered  users     56  countries    Taverna:     1400+  workflows     361  organisa"ons       48  countries     BioCatalogue:       70,000+  downloads     2000+  services     ~4000  source     150+  service  providers     500+  members     27  countries   http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  39. 39. Acknowledgements  http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  40. 40. http://www.mygrid.org.uk/   http://www.taverna.org.uk/  
  41. 41. More  informa"on    hhp://www.mygrid.org.uk/    hhp://www.taverna.org.uk/    hhp://www.myexperiment.org/    hhp://www.biocatalogue.org/     http://www.mygrid.org.uk/   http://www.taverna.org.uk/  

×