Invited talk @ ESIP summer meeting, 2009
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Invited talk @ ESIP summer meeting, 2009

  • 1,099 views
Uploaded on

ESIP summer meeting, July 7-10 2009, Santa Barbara, CA

ESIP summer meeting, July 7-10 2009, Santa Barbara, CA

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • webdesigninmanchester.co.uk great post!
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
1,099
On Slideshare
1,099
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
5
Comments
1
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Repetitive and mundane boring stuff made easier, reliable and adaptable. Big science and collaborative science
  • Interoperability, Integration and Collaboration Automated processing Interactive Repetitive and accurate compound processes (protocols) Transparent processes Data flow Trackable results Agile software development
  • This workflow searches for genes which reside in a QTL (Quantitative Trait Loci) region in the mouse, Mus musculus. The workflow requires an input of: a chromosome name or number; a QTL start base pair position; QTL end base pair position. Data is then extracted from BioMart to annotate each of the genes found in this region. The Entrez and UniProt identifiers are then sent to KEGG to obtain KEGG gene identifiers. The KEGG gene identifiers are then used to searcg for pathways in the KEGG pathway database. this is pathways_and_gene_annotations_for_qtl_phenotype_28303 exec with chromosome = 17 start_position = 28500000 end_position = 32500000
  • mention scalability: we support large datasets as well, in addition to these small listsand that lists can grow very large
  • The case study of Paul’s work Different datatypes, held in geographically different places of different subdisciplines. Phenotypic response investigated using microarray in form of expressed genes or evidence provided through QTL mapping Genes captured in microarray experiment and present in QTL (Quantitative Trait Loci ) region Microarray + QTL
  • The workflow produces OMIM tagged diseases which can be used to enrich the proto-ontology automatically in RDF
  • - caGrid (or the Grid) is the underlying network architecture and platform that provides the basis for connectivity of caBIG® tools. - ARC (australian Research Center) plugin for Taverna used for medical imaging Taverna is a workflow management system well known in bioinformatics. To show the ease of transitioning from local to grid, we execute MUSTANG first on the command line, then on the grid using Taverna. We then present two examples from the field of medical imaging, where the first one has to deal with huge temporary datasets. It thus greatly benefits from ARCs storage management and grid URL handling capabilities. The last example shows how one can achieve rapid testing iterations by separating the program binary from its use case description and the workflow. In the end, myExperiment is presented, which is a free web community for sharing Taverna workflows. By preparing a use case and an example workflow one can make his own program easily usable by everyone since the ARC middleware equalizes different grid configurations. video
  • aimed at different layers of the software stack “ The Many Faces of IT as Service”, Foster, Tuecke, 2005 “ Provisioning” – reservation to configuration to … … make sure resource will do what I want it to do, with the right qualities of service Virtualization = separation of concerns between provider & consumer of “content” Client and service Service provider and resource provider Provisioning = assemble & configure resources to meet user needs Management = sustain desired qualities of service despite dynamic environment
  • The GT4 plug-in support semantic based service query. Users can input multiple (up to three, in current scenario three is enough. We can add more in our program upon request) service query criteria and input the corresponding value. We can combine multiple criteria. The initial GUI only shows one query criteria, but more can be added by clicking “Add Service Query” button. For example, we can query the caGrid services whose “Research Center” name is “Ohio State University”, with Service Name “DICOMDataService”, and has operation “PullOp”.
  • we are now taking a broader view... what we have shown so far is just the “run” part of a more comprehensive lifecycle
  • Beanshell scripting and XML processing support inside the workflows Taverna 2: long running workflows, data reference handling, data streaming and staging, multiple extensibility points. Complete the Taverna 2 properties New data reference handling, security management, provenance management, asynchronous processor and data streaming, explicit monitoring and steering support, new dispatch layer better, supports dynamic service binding and service invocation through a resource broker, improved concurrency handling at the workflow level
  • The clustalw program from Emboss is called ‘emma’ Services are not deposited and preserved in software libraries. Rapid metadata heart-beat, especially on operational metadata. BioNanny – using Grid tools Use myExperiment to notify scientists with potential problems Use myExperiment to be smart about which services should be monitored. Workflows are deposited but…. Not self-contained. Linking to external services in flux. Or depend on software Incorporating services unavailable to others. Workflow fragility and hence decay. Workflows become plans and provenance rather than working scientific objects unless tended and updated.
  • for DEMO: search adaptivedisclosure key point: large-scale curation of rich service descriptions through community engagement, sustained over time, to ensure the quality of annotations what makes biocat stand apart from other approaches to service repositories? biocatalogue is a "super-registry" that is able to accommodate service descriptions from multiple different source registries thanks to a flexible annotation model context of application: distributed, P2P biocatalogue
  • SeekDA is DERI. search engine for WS http://www.ebi.ac.uk/uniprot-das/
  • curation results in trust and therefore usage encourages / justifies sustained further curation efforts really happens in all the associated tools that have references and use biocatalogue e.g. curation of Taverna services, of myExp workflows, etc accreditation of curators: idea borrowed from the wikipedia style of community contribution. not all contributors are equal, but all are entitled to providing contributions
  • how are the types of annotations chosen? contributors are free to add pretty much any types of annotations at the moment, using a simple tagging mechanism, i.e., annotations are not necessarily locked into controlled vocabularies / ontologiesthink of them as name-value pairs for the time being, but already exposing metadata as RDF through the APIusers can rate services. main rating criteria expected to be ease of use, availability and quality of documentation. Separate subjective from objectiveadditionally, there is room for automated curation: using Quasar, for exampleservice monitoring and associated automated annotations. functional testing of services. let users/ providers upoload complex test scripts, including full-fledged workflows, for example. biocatalogue can run these periodically and annotate the services with the outcome (think "junit test reports" as automated annotations)Service Profile Wheel AvailabilityFreshness
  • Automated monitoring & testing Test scripts, endpoint availability, meantime failure Partner feeds myExperiment.org Workflow profile Update feeds to users Develop incentives Expert for oversight How do we rank? How do we compare non-alike?
  • Cite FLOSS
  • myExperiment is as much an engineering project as it is a social experiment
  • e-Science is me-Science Aligning community with individual. But we have to aware of the drivers for collaboration. Competitive advantage. Be the first with the Nature paper. Academic vanity Credit, credibility, fame, acclaim, recognition, peer respect, reputation. Adoption Get my stuff adopted / recognised More funding Being found out Open to rigorous inspection. Being scooped Beaten by lab X Protecting my turf. Releasing results too early. Getting left behind. Being out of fashion. Looking stupid Being misinterpreted or misrepresented. Looking stupid. Losing control. Taking a risk
  • attribution -> provenance
  • Transferring data, methods and know-how from one discipline to another (e.g. astronomy image analysis applied to cancer tissue microarrays) How do you find relevant material that uses a different jargon in a different discipline organised to only suit its experts? Validation? How do I know it does what it says it does? Reproducibility? When the services are volatile. Reusability? When it contains an in-house code or application. Longevity? Will this workflow still run in 6 months time? Palpability? Why does this workflow fail? why does it work? How does it work?
  • Validation? How do I know it does what it says it does? Reproducibility? When the services are volatile. Reusability? When it contains an in-house code or application. Longevity? Will this workflow still run in 6 months time? Palpability? Why does this workflow fail? why does it work? How does it work?
  • attirbution -> provenance curation -> provenance
  • make ref to second talk -- on ROs
  • myExperiment is as much an engineering project as it is a social experiment
  • step back: what do we mean by provenance, in general?data and process provenancewhat do we do with it?we collect a great deal of metadata, at very fine level of granularity: what can we expect to get out of it?
  • see dedicated in-depth talk for details on our approach
  • Lineage query lets users identify variables that carry interesting values for which provenance is sought nodes in the graph where provenance information should be reported

Transcript

  • 1.
    • Dr. Paolo Missier , Prof. Carole Goble
    • Information Management Group
    • School of Computer Science, University of Manchester, UK
    • with additional material kindly provided by:
    • Prof. Dave DeRoure, Prof. Luc Moreau, Univ. of Southampton, UK
    • Andrea Wiggins, Syracuse University, NY
    Taverna, Biocatalogue, myExperiment, and the provenance of it all: forward-looking while looking back Scientific Workflow Management System
  • 2. Outline
    • Part I: models and technology for e-science
    • Addressing the needs of the e-scientist:
      • Workflow as a model of experimental science
        • Taverna
        • Services as building blocks
        • Biocatalogue
    • Scaling up along the social dimension:
        • towards open, collaborative science
        • myExperiment
    • Part II: Explaining and Preserving experimental outcomes
    • Data provenance support in Taverna
    • provenance for open science: the OPM vision
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
  • 3. What is the myGrid Project?
    • UK e-Science pilot project since 2001.
    • Centred at Manchester, Southampton and the EMBL-EBI
    • Part of Open Middleware Infrastructure Institute UK http://www.omii.ac.uk .
    • Mixture of developers, bioinformaticians and researchers
    • An alliance of contributing projects and partners
    • Open source development and content LGPL or BSD
    • Infrastructure
    • We don’t own any resources (apart from catalogues)
    • Or a Grid.
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
  • 4.
    • Data preparation and analysis pipelines.
    • Data preparation pipelines
    • Data integration pipelines
    • Data analysis pipelines
    • Data annotation pipelines
    • Warehouse population refreshing
    • Data and text mining
    • Knowledge extraction.
    • Parameter sweeps over simulations/computations
    • Model building and verification
    • Knowledge management and model population
    • Hypothesis generation and modelling
    E. Science laboris ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
  • 5. Workflows: E. Science laboris
    • Pipeline processing
    • Automated processing
    • Repetitive and mundane boring stuff made easier, reliable and adaptable.
    • Shield interoperability horror
    • Trackable results
    • Agile software development
    • Big science, small science & collaborative science
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
  • 6.
    • Workflow Management System is machinery for coordinating the execution of (scientific) services and linking together (scientific) resources.
    • Handles cross cutting concerns like error handling, service invocation, data movement, data streaming, provenance tracking…..
    • A workflow is a specification configured for each run.
    Workflows: E. Science laboris ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
  • 7. Separate the “workflow” from the application
    • Web Services
      • Programmatic Interfaces for software rather than people, (described by a Web Service Description Language (WSDL) Document).
    • Workflows
      • Automated and coordinated sequences of tasks. Automated bioinformatics protocols.
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Retrieves a protein sequence in Fasta format from Genbank and then performs a BLAST on that sequence
  • 8. Workflow as data integrator ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier QTL genomic regions genes in QTL metabolic pathways (KEGG)
  • 9. Data-driven computation in Taverna ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ] [ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ] List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] geneIDs pathways • • • • • • • • • • geneIDs pathways • • • • • • • • • •
  • 10. Integration platform Just in Time and Just in Case Interoperability
    • Shield users and applications from complexity.
    • Handle plumbing.
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
  • 11. Integration platform Just in Time Interoperability
    • Shim services and scripts
    • Mapping and format transformations
    • Beanshell scripting inside
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
  • 12. What do Scientists use Taverna for? ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Systems biology model building Proteomics Sequence analysis Protein structure prediction Gene/protein annotation Microarray data analysis QTL studies QSAR studies Medical image analysis Public Health care epidemiology Heart model simulations High throughput screening Phenotypical studies Phylogeny Statistical analysis Text mining Astronomy, Music, Meteorology Netherlands Bioinformatics Centre Genome Canada Bioinformatics Platform BioMOBY US FLOSS social science program RENCI SysMO Consortium French SIGENAE farm animals project ThaiGrid CARMEN Neuroscience project SPINE consortium EU Enfin, EMBRACE, BioSapian, Casimir EU SysMO Consortium NERC Centre for Ecology and Hydrology Bergen Centre for Computational Biology Max-Planck institute for Plant Breeding Research Genoa Cancer Research Centre AstroGrid 30 USA academic and research institutions
  • 13. ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier 200 Genotype Phenotype Metabolic pathways Literature [Paul Fisher]
  • 14. Hypothesis Construction and Explanation from the Literature my BioAID
    • Marco Roos, Scott Marshall,
    • University of Amsterdam
    • Combines text mining with ontology modelling
    • AIDA: text mining and machine learning toolbox
    • Part of the Virtual Laboratory project VLe
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
  • 15. Genome-wide SNP data sets analysis
    • Analysis over compute clusters
    • Automate annotation of results
    • Mine annotation data for patterns
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
  • 16. WaaS: Workflows as a Service ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier [Pettifer, Kell, University of Manchester] inside
  • 17. Workflows operating over Grid Infrastructure ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier http://www.knowarc.eu KnowARC integrated with Taverna" application prototype to use Taverna as direct interface to Grid resources running ARC. Open source grid software infrastructure aimed at enabling multi-institutional data sharing and analysis. Underpins caBIG. Taverna links together caGrid resources. http://cagrid.org / http://www.eu-egee.org/ Europe’s leading grid computing project, Piloted Taverna over EGEE gLite services
  • 18. ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Provisioning Workflows Appln Service Appln Service Users Workflows Composition Incorporation Invocation
    • Service-oriented applications
    • Applications components of workflows
    • Compose applications into workflows
    • Incorporate workflows into applications
    • Service-oriented Grid Infrastructure
    • Provision physical resources to support application workflows
    • Coordinate resources through workflows
    [Foster 2005] Workflows
  • 19. caBIG cancer cyberinfrastructure uses Taverna to link services ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier A sample caGrid workflow for microarray analysis, using caArray, GenePattern and geWorkbench [Ravi Madduri] Orchestrating caGrid Services in Taverna Wei Tan, Ravi Madduri, Kiran Keshav, Baris E. Suzek, Scott Oster, Ian Foster, Proc IEEE Intl Conf on Web Services (ICWS 2008)
  • 20. Who else is in this space? ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Kepler Triana BPEL Ptolemy II Taverna Trident BioExtract
  • 21. An Ecosystem of Workflow Management Systems (WFMS)
    • Askalon
    • Bigbross Bossa
    • Bea's WLI
    • BioPipe
    • BizTalk
    • BPWS4J
    • Breeze
    • Carnot
    • Con:cern
    • DAGMan
    • DiscoveryNet
    • Dralasoft
    • Enhydra Shark
    • Filenet
    • Fujitsu's i-Flow
    • GridAnt
    • Grid Job Handler
    • GRMS (GridLab Resource Management System)
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
    • ObjectWeb Bonita
    • OFBiz
    • OMII-BPEL
    • Open Business Engine
    • Oracle's integration platform
    • OSWorkflow
    • OpenWFE
    • Q-Link
    • Pegasus
    • Pipeline Pilot
    • Platform Process Manager
    • P-GRADE
    • PowerFolder
    • PtolemyII
    • Savvion
    • Seebeyond
    • Sonic's orchestration server
    • Staffware
    • GWFE
    • GWES
    • IBM's holosofx tool
    • IT Innovation Enactment Engine
    • ICENI
    • Inforsense
    • Intalio
    • jBpm
    • JIGSA
    • JOpera
    • Kepler
    • Karajan
    • Lombardi
    • Microsoft WWF
    • NetWeaver
    • Oakgrove's reactor
    • ScyFLOW
    • SDSC Matrix
    • SHOP2
    • Taverna
    • Triana
    • Twister
    • Trident
    • Ultimus
    • Versata
    • WebMethod's process modeling
    • wftk
    • XFlow
    • YAWL Engine
    • WebAndFlo
    • Wildfire
    • Werkflow
    • wfmOpen
    • WFEE
    • ZBuilder ……
  • 22. Workflow-based experimentation lifecycle ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Run Analyse Results Publish Develop Collect and query provenance metadata
  • 23. Taverna ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Graphical Workbench For Professionals Plug-in architecture Nested Workflows Drag and Drop Wiring together Rapidly incorporate new service without coding. Not restricted to predetermined services Access to local and remote resources and analysis tools 3500+ service operations available when start up
  • 24. Services Mutability implications for sustainability, accountability and reproducability
    • Reliability and robustness of workflows depends on the reliability and robustness of the components
    • In house service support
    • Services in constant and (silent) change.
    • Versioning.
    • Workflow Decay
    • Monitoring and Repair of wrappers, shims and service substitutions.
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
  • 25. BioCatalogue
    • http://www.biocatalogue.org
    • http://beta.biocatalogue.org
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier 28 April 2009, Boston MA Professor Carole Goble University of Manchester, UK Director myGrid Consortium
  • 26.
    • Public, Curated Catalogue of Life Science Web Services
    • Register, Find, Curate Web Services
    • Community-sourced annotation, expert oversee
    • Open content
    • Open platform with open REST interfaces
    • Web 2.0 site and development.
    • Open source code base.
    • Started June 2008. In first beta phase.
    • Launch ed June 2009 at ISMB.
    • www.biocatalogue.org
    The short story
  • 27.  
  • 28. Content
    • Community contributed
      • Service providers
      • Third Parties
    • Automated crawling
    • Sourced from partners and registries
    • Chiefly public services
    • "import and add value"
  • 29. A virtual circle
    • curation involves substantial human effort
      • why would it happen at all?
    curation usage trust annotators content
  • 30. ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Curation Model Versioning Quantitative Content Tags Service Model Semantic Content Ontologies Functional Capabilities Provenance Operational Capabilities Operational Metrics Usage Policy Community Standing Ratings Usage Statistics Attribution Free text Searching Statistics Usable and Useful Understandable Controlled vocabs Interfaces
  • 31. Curation
    • Just enough just in time
    • Universal annotation scheme
    • Mixed: Free text, Tags, controlled vocabs, community ontologies
    • Community sourced tags, comments, recommendations
    • Expert curation ontology-based annotation. myGrid OWL Ontology
    • Automated WSDL ripping and analytics
    • Automated monitoring & testing
    • Partner feeds (e.g. myExperiment)
    • Update feeds to users
    Today: 14902 annotations (provider, user, registries) KEGG: 1433 annotations
  • 32. Service monitoring
  • 33. Who?
    • Three years guarantee funding (June08-May11)
    • Sustainability guarantee by EMBL-EBI
  • 34. Scaling up along the social dimension ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier What ? Where ? Why ? Who ? How ? Crossing the boundaries of individual investigation Develop Run Analyze Publish Develop Run Analyze Publish
  • 35. Scientific collaboration ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Source: Andrea Wiggins , talk given at the School of Computer Science, University of Manchester, UK, June 18th, 2009
  • 36. ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Source: Andrea Wiggins , talk given at the School of Computer Science, University of Manchester, UK, June 18th, 2009
  • 37. ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Source: Andrea Wiggins , talk given at the School of Computer Science, University of Manchester, UK, June 18th, 2009
  • 38. The Selfish (or Self-interested) Scientist
    • “ A biologist would rather share their toothbrush than their gene name”
    Mike Ashburner and others Professor Genetics, University of Cambridge,UK “ Data mining: my data’s mine and your data’s mine”
  • 39. The potential for collaboration
    • What :
      • processes: “materials and methods” -> workflows
      • data: unlikely, and certainly not until published
      • metadata (annotations, provenance traces...): ??
    • When :
      • for contributors: part of publication process
        • some publishers demand public data and repeatable experiments
      • for consumers: reuse as part of experiment design
    • Where and how :
      • a meeting point for a virtual community
      • Web 2.0 style of interaction
      • voluntary, incentive-based contributions
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
  • 40. Asymmetric vs symmetric sharing ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
    • Producer-consumer:
      • from service providers to workflow designers
      • Biocatalogue
    • Peer-based
      • sharing of workflows as complex processes
      • myExperiment
    • Service space “closed under composition”:
      • workflows are compositions of services
      • ... and they are services themselves
    • Scientists become providers
      • of conceptual process models
      • and of executable services, as well!
    Traditional sharing is asymmetric: Open science is symmetric: myGrid combines both paradigms:
  • 41. Publishing for collaboration ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Run Analyse Results Publish Develop Collect and query provenance metadata Design-time reuse: Composition from existing workflows Runtime reuse: Workflows as services compare results across versions foster virtual scientific communities provenance exchange and interoperability the OPM experiment
  • 42. Collaboration in the workflow space ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier What ? Where ? Why ? Who ? How ? Develop Run Analyze Publish Develop Run Analyze Publish
  • 43.
    • Collaborate and share with my colleagues and friends I trust.
    • And people I don’t and may never know. And rivals?
    • Sharing happens….
      • Compelled.
      • Culture.
      • In best interest.
      • Good citizenship.
      • Incentive & Rewards.
  • 44. Competitive advantage. Academic vanity. Adoption. Reputation. Scrutiny. Being scooped. Misinterpretation. Reputation. Rewards Fears
  • 45.
    • Getting author to take credit!
    • Creating a culture of attribution.
    • Attribution and credit chains.
    • Licensing and rights protection
  • 46. Incentive and reputation
    • Strong sense of persistent identity.
    • Building reputation and boasting opportunities.
    • Cult of the individual.
    • High visibility to the participant and the community.
    • Downloads & Views.
    • Instrumentation and automated analysis.
    • Feedback.
    • Liability policy.
  • 47. Distinct myExperiment communities
    • Supermarket shoppers Workflow consumers prefer larger workflows ready to be downloaded and enacted
    • Tool builders Workflow authors prefer smaller, modularized workflows which can be assembled & customized
    • Leaders and followers
    • For contributions
    +
  • 48. Reuse, Recycling, Repurposing Cross-fertilization
    • Paul writes workflows for identifying biological pathways implicated in resistance to Trypanosomiasis in cattle
    • Paul meets Jo. Jo is investigating Whipworm in mouse.
    • Jo reuses one of Paul’s workflow without change .
    • Jo identifies the biological pathways involved in sex dependence in the mouse model, believed to be involved in the ability of mice to expel the parasite.
    • Previously a manual two year study by Jo had failed to do this.
  • 49.
    • Which workflow do other people I respect recommend for microarray normalisation?
    • What does this workflow do?
    • Can I run this foreign workflow?
    • How do I run and steer this workflow?
    • Can I substitute my service/dataset?
    • Did someone else run this and have a problem? Or get a different result?
    • What other workflows were used with this workflow?
    • Is this workflow reliable? Does it still work? How can I make it work when one of the services doesn’t run anymore?
    • Can I interpret the provenance of this data item properly?
    ?
  • 50.
    • Sunday 10th May: 1748 registered users, 143 groups, 669 workflows, 197 files, 52 packs 56 different countries. Top 4: UK, US, The Netherlands, Germany
    Socially share, discover and reuse workflows and other methods. Cooperative bazaar. www.myexperiment.org
  • 51.  
  • 52. Just Enough Sharing Credit and Attribution
  • 53. Authoring workflows for reuse
    • Workflow decay as services decay
      • Service and Workflow analytics
    • Anonymous reuse
      • Reluctance to relinquish control
      • Poor metadata, poor palpability
      • Local services / licensing lock-in
    • Incentives for sharing
      • Reputation, Rights, Protection
    • Author guarantees
      • Credit and attribution; Accurate reuse
    • Consumer guarantees
      • Verification, validation and safety
    • Best practice
  • 54. Authoring workflows for reuse
    • Transferring data, methods and know-how from one discipline to another (e.g. astronomy image analysis applied to cancer tissue microarrays)
    • How do you find relevant material that uses a different jargon in a different discipline organised to only suit its experts?
  • 55. Workflows and Services Curation by Experts Social Curation by the Crowd refine validate refine validate Self-Curation by Contributors seed seed refine validate seed refine validate seed Automated Curation
  • 56. Packs
  • 57.  
  • 58. Three chief user groups
    • Content contributors
    • Curators and editors
    • Consumers
    Contributors Members Anon. Users
  • 59. New Instances
  • 60. Collaboration in the workflow space ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier What ? Where ? Why ? Who ? How ? Develop Run Analyze Publish Develop Run Analyze Publish
  • 61. Technical implications of open science
    • Process interoperability
      • SOA principles: runtime interoperability
      • but, still no common workflow model after all!
    • Data interoperability
      • Traditional heterogeneity / integration issues
      • Dataspaces
      • LinkedData
      • ...
    • Aggregation: creating logical units
      • process + inputs + outputs + provenance traces + ...
      • Research Objects
    • Provenance interoperability
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
  • 62. Provenance of data
    • “ It would include details of the processes that produced electronic data as far back as the beginning of time or at least the epoch of provenance awareness.”
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Luc Moreau, Paul Groth, Simon Miles, Javier Vazquez-Salceda, John Ibbotson, Sheng Jiang, Steve Munroe, Omer Rana, Andreas Schreiber, Victor Tan, Laszlo Varga, The provenance of electronic data, Communications of the ACM, Vol. 51 No. 4, Pages 52-58
  • 63. Analysis of process results
    • Causal relations:
      • which pathway sets come from which gene sets?
      • which processes contributed to producing this image?
      • which process(es) caused this data to be incorrect?
      • which data caused this process to fail?
    • Process and data analytics:
      • show me the variations in output in relation to an input parameter sweep (multiple process runs)
      • how often has my favourite service been executed?
        • on what inputs?
      • who produced this data?
      • how often does this pathway turn up when the input genes range over a certain set S?
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
  • 64. Example: inverse data associations ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ] [ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ] geneIDs pathways • • • • • • • • • • geneIDs pathways • • • • • • • • • •
  • 65. Taverna + provenance
    • Taverna type system: strings + nested lists
      • “ cat”, [“cat”, “dog”], [ [“cat”, “dog”], [“large”, “small”] ]
    • Taverna dataflow model: data-driven execution
        • services activate when input is ready
    • Workflow provenance: a detailed trace of workflow execution
      • which services were executed
      • when
      • inputs used, outputs produced
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
  • 66. Forms of provenance ... ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Focus is on the data: the observable outcomes of a process raw provenance metadata provenance metadata + interpretation framework design
    • process structure (workflow graph) history of process composition - reuse
    • process versions
    • service annotations:
    • ex. get_pathways_by_genes
    • who created /edited: attribution why: purpose, intent
    execution
    • process events:
    • service invocation
    • data production / consumption
    • causal dependency graphs
    • ex.:
    • list_of_geneIDList = [ a, b, c]paths_per_gene = [ [d,e,f], [g,h,j]]... in run #32
    • data annotations,
    • results interpretation in terms of conceptual data model:
    • set of pathways -> gene sets
  • 67. ...and their uses and associated challenges ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier fully implemented in Taverna 2 raw provenance metadata provenance metadata + interpretation framework design
    • exploiting semantic properties of the process structure to improve provenance exploitation exploring process space across versions and structural similarities
    • graph matching
    • semantic-based search of process space
    execution
    • enabling partial re-runs of resource-intensive workflows
    • storing very large provenance traces that accumulate over time
    • efficient query over large traces
    • presentation of query answers
    • semantic-based query answering over annotated traces
    to be released in Sept. 2009
  • 68. Querying provenance traces
    • Lineage queries involve traversing a provenance graph from bottom to top
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier [ p1, ....] [ g1, ....]
  • 69. Naive provenance trace queries
      • In most approaches, the originating process are not used for querying
      • consequence: query requires provenance graph traversal
        • large traces -> computationally complex
        • view materialization used in practice to get around the computational complexity
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Z. Bao and S. Cohen-Boulakia and S. Davidson and A. Eyal and S. Khanna, Differencing Provenance in Scientific Workflows , Procs. ICDE, 2009
  • 70. Querying provenance graphs in Taverna
    • Users are rarely interested in the complete provenance graph
      • noisy, possibly large, difficult to navigate
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier This results in a more efficient lineage query algorithm that scales to large provenance graphs Example: BACKTRACE (paths_per_gene[3,4], paths_per_gene[1,2]) AT get_pathway_by_genes AND commonPathways[1] AT TOP
  • 71. Provenance management architecture ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Taverna runtime process design Provenance capture component
    • provenance events
    • input arrived
    • service invoked
    • output produced
    relational data model Lineage query processor workflow results workflow inputs Provenance DB Results browser Results analysis Provenance browser
  • 72. Provenance interoperability for open science OPM ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Develop Run Analyze Publish Develop Run Analyze Publish OPM: the Open Provenance Model
  • 73. The Science Lifecycle scientists Graduate Students Undergraduate Students experimentation Data, Metadata, Provenance, Scripts, Workflows, Services, Ontologies, Blogs, ... Digital Libraries Next Generation Researchers Adapted from David De Roure’s slides Local Web Repositories Virtual Learning Environment Technical Reports Reprints Peer-Reviewed Journal & Conference Papers Preprints & Metadata Certified Experimental Results & Analyses
  • 74. Finding the Provenance of research outputs across all the systems data transited through scientists Local Web Repositories Graduate Students Undergraduate Students Virtual Learning Environment Technical Reports Reprints Peer-Reviewed Journal & Conference Papers Preprints & Metadata Certified Experimental Results & Analyses experimentation Data, Metadata, Provenance, Scripts, Workflows, Services, Ontologies, Blogs, ... Digital Libraries Next Generation Researchers
  • 75. Provenance Across Applications Local provenance stores Adapted from Luc Moreau’s slides: “The Open Provenance Model” (Univ. of Southampton,UK), 2009 Application Application Application Application Application Provenance Inter-Operability Layer import from OPM export to OPM
  • 76. Illustration
    • Process “used” artifacts and “generated” artifact
    • Edge “roles” indicate the function of the artifact with respect to the process (akin to function parameters)
    • Edges and nodes can be typed
    • Causation chain:
    • P was caused by A1 and A2
    • A3 and A4 were caused by P
    • Does it mean that A3 and A4 were caused by A1 and A2?
    P used(divisor) used(dividend) wasGeneratedBy(rest) wasGeneratedBy(quotient) type=division From Luc Moreau’s slides set: “The Open Provenance Model” (Univ. of Southampton,UK), 2009 A1 A2 A3 A4
  • 77. Integrated OPM generation in Taverna ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
  • 78. More information on OPM
    • The OPM wiki:
      • http://twiki.ipaw.info/bin/view/OPM/
        • open to discussions and contributions
        • please read the governance doc
    • The 3rd provenance challenge:
      • produce and export OPM graphs
        • interoperable XML and RDF serializations
      • import and query third party graphs
      • http://twiki.ipaw.info/bin/view/Challenge/ThirdProvenanceChallenge
      • The University of Manchester’s contribution to the challenge:
        • http://twiki.ipaw.info/bin/view/Challenge/UoM
      • latest meeting held in June, 2009 (Amsterdam)
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
  • 79. From knowledge capture to exploitation
    • answering user questions effectively
    • (using provenance + semantics infrastructure)
      • has a similar investigation been undertaken before? when, by whom? what was the outcome?
      • have alternative services being used? to what effect?
      • what have been the users’ decisions, and why?
    • Enabling collaborative science :
      • provide users with recommendations on the next steps in their session, based on analysis of their past actions;
      • cluster users within a group based on their common interests, observed through the choices they make during the sessions
      • promote socal/scientific networking
        • a “blog the lab” flavour
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
  • 80. The Provenir ontology
    • Upper ontology with for domain-specific extensions
    • OWL, designed for reasoning and RDF queries
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Satya S. Sahoo, Roger S. Barga, Jonathan Goldstein, Amit P. Sheth, Where did you come from...Where did you go?” An Algebra and RDF Query Engine for Provenance , TR-2009-03, Kno.e.sis Center, CSE Dept., Wright State University, Dayton, OH, March, 2009
  • 81. Upcoming events ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier SWPM 2009: The First International Workshop on the Role of Semantic Web in Provenance Management http://wiki.knoesis.org/index.php/SWPM-2009 Co-located with ISWC'09, October 25/26 2009, Washington D.C., USASubmission Deadline: Friday, July 31, 2009 Special issue of Future Generation Computer Systems Journal (FGCS) on the third provenance challenge (to be announced) expected deadline: Dec., 2009
  • 82. Summary: Support for collaborative science ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
    • Provenance :
    • automated metadata collection and processing
    • data management angle: “logs with a proper data model”
    • storage and query issues
    • interoperability
    • social angle : attribution chain for experimental artifacts
    • processes, data, annotations
    • myExperiment packs -> Research Objects
  • 83. Selected literature on provenance
    • P. Buneman, S. Khanna, W. Chiew Tan, Why and Where: A Characterization of Data Provenance , Procs. ICDT, 2001
    • Susan B. Davidson and Juliana Freire, Provenance and scientific workflows: challenges and opportunities , Procs. SIGMOD, 2008
    • Z. Bao and S. Cohen-Boulakia and S. Davidson and A. Eyal and S. Khanna, Differencing Provenance in Scientific Workflows , Procs. ICDE, 2009
    • Luc Moreau, Paul Groth, Simon Miles, Javier Vazquez-Salceda, John Ibbotson, Sheng Jiang, Steve Munroe, Omer Rana, Andreas Schreiber, Victor Tan, Laszlo Varga, The provenance of electronic data , Communications of the ACM, Vol. 51 No. 4, Pages 52-58, 2008
    • P. Missier, K. Belhajjame, J. Zhao, C. Goble, Data lineage model for Taverna workflows with lightweight annotation requirements , Procs. IPAW 200 8
    • M. Anand, S. Bowers, T. McPhillips, B. Ludaescher, Efficient Provenance Storage over Nested Data Collections , Procs. EDBT, 2009
    • J. Zhao, C. Goble, R. Stevens, D. Turi, Mining Taverna's semantic web of provenance , Concurrency and Computation: Practice and Experience, Vol. 20 no. 5, 2008 .
    • R. S. Barga and L. A. Digiampietri, Automatic capture and efficient storage of e-Science experiment provenance , Concurrency and Computation: Practice and Experience, Vol. 20 no. 8, 2008
    ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier