Your SlideShare is downloading. ×
DCC Keynote 2007
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

DCC Keynote 2007


Published on

A keynote given on experiences in curating workflows and web services. …

A keynote given on experiences in curating workflows and web services.

3rd International Digital Curation Conference: "Curating our Digital Scientific Heritage: a Global Collaborative Challenge"
11-13 December 2007
Renaissance Hotel
Washington DC, USA

Published in: Technology, Education

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • 3rd International Digital Curation Conference "Curating our Digital Scientific Heritage: a Global Collaborative Challenge" 11-13 December 2007 Renaissance Washington DC Hotel Washington DC, USA
  • Transcript

    • 1. Curating Services and Workflows The Good, the Bad and the Ugly A Personal Story in the Small Professor Carole Goble The University of Manchester, UK [email_address] Keynote: 3 rd International Digital Curation Conference, Washington DC, 11-13 December 2007
    • 2.  
    • 4. Programmatic Interfaces to Services (Web Services not Web Sites) Your Script Service Registry Web Service SeqFetch Service BLAT Service BLAST Service SeqFetch Service GO Service Adapted from Lincoln Stein Your Workflow Your Application Interface Description Document WSDL WADL European Bioinformatics Institute API submissions has risen to 3,166,901 for 2007 (Sarah Hunter)
    • 5. [Mark Wilkinson, 2006]
    • 6.
      • Workflows describe the scientists in silico experiment
        • Link together and cross reference data in different repositories
        • Mechanism for interoperating.
        • And that includes publications!
      • Remote, third party, external applications and services
        • Accessible to the workflow machinery
        • And that includes data and publications!
      • Results management
        • Semantic metadata annotation of data
        • Provenance tracking of results
      • Sharing and replicating know-how
        • Reuse of workflows
      Viva la Workflows!
    • 7. my Grid Taverna Workflow Workbench
    • 8.
      • 41000+ downloads
      • 40 per day since June 2006.
      • Ranked 210 sourceforge activity (06 06 07)
      • Open Source Development
      • Used throughout the world
      • Systems biology – SysMo Consortium
      • Proteomics
      • Gene/protein annotation, Microarray data analysis, Medical image analysis
      • Heart simulations, High throughput screening, Phenotypical studies, Phylogeny
      • Plants, Mouse, Human
      • Astronomy, Music, Geography
      • Text mining
      • And Curation….
    • 9. Because software needs curating too. Manchester Southampton Edinburgh European Bioinformatics Institute
    • 10. Automated Curation using Workflows
      • Coordinating data mirroring refreshes
      • Refreshing Data warehouses
        • e-Fungi, ISPIDER
      • Rebuilding lost databases
        • tGRAP when collapsed picked up by Nijmegen and rebuilt using workflows over two days.
      • Text mining
        • Very, very popular.
      • Workflows instead of data curation?
        • Data regenerated on demand.
        • Curate the workflow and not the data?
      Bas Vroling, Gert Vriend CMBI NCMLS UMC Nijmegen
    • 11. Workflows are reading publications. Workflows are processing the data. Workflows are part of curation pipelines Workflows are another form of outcome to publish and curate alongside data and publications
    • 12. Workflows are….
      • … provenance of data
      • … g eneral technique for describing and enacting a process, like a script or a protocol or a method
      • … precise, unambiguous and transparent protocols and records.
      • … often complex, so they need explaining.
      • … often challenging and expensive to develop.
      • … know-how and best practice.
      • … collaborations.
      • … valuable first class scientific assets in their own right.
      • Services are steps in the workflow, and a workflow can be deployed as a service. They are “ Social Networks ” of services. More on this later….
    • 13.
      • “ We need to curate methods as well as data. With the new large scale data sets process matters as much as content and we are rubbish at curating, capturing and reusing it . Much of what we now rely on is processed, not raw data. We have strategies for curating the raw data - indeed multiple standards.
      • Thus, in life sciences we have a gaping void in our curation . We need standards, need places to put methods, and places to allow re-use.
      Professor Andy Brass, Bioinformatics
    • 14. Towards Reproducible Science (with Reproducible Scientific Objects)
    • 15. Trypanosomiasis in Cattle
      • Identified a pathway for which its correlating gene (Daxx) is believed to play a role in trypanosomiasis resistance.
      • Systematic and comprehensive automation. Elimination of user bias.
      Fisher P et al A systematic strategy for large-scale analysis of genotype–phenotype correlations: identification of candidate genes involved in African trypanosomiasis, Nucleic Acids Research, 2007, 1–9 A PhD student. Paul Fisher.
    • 16. Recycling, Reuse, Repurposing
      • A Trypanosomiasis in Cattle workflow (by Paul) reused without change for Trichuris muris Infection (by Jo).
      • Identified the biological pathways believed to be involved in the ability of mice to expel the parasite.
      • Workflows are memes. Scientific commodities. To be exchanged and traded and vetted and mashed. Users add value.
    • 17. Scientific memes. Scientific viruses. Increasing numbers. Kepler Triana BPEL Ptolemy II
    • 18. Aerospace Engine Design 90% of design is variant design 70% of information is taken from previous designs Source: Silvia Wong, University of Southampton, UK
    • 19. Digital Library Graduate Students Undergraduate Students e-Experimentation e-Scientists Certified Experimental Results & Analyses Data, Metadata & Ontologies Workflows Adapted from the eBank project Institutional Archive Local Web Publisher Holdings Virtual Learning Environment Technical Reports Reprints Peer-Reviewed Journal & Conference Papers Preprints & Metadata
    • 20. If I had (well) curated services and workflows I could….
      • Browse around and see what is out there and stop reinventing the wheel.
      • Find a service based on what it does (or was meant to do), and what it consumes as inputs and produces as outputs, and what it uses, or because it matches (somehow) something I have already
      • Understand how it works and when it works
      • Know where there are exact copies or similar services I can use as alternates
      • Know whether I have permission to use it, or have the set up to use it.
    • 21. If I had (well) curated services and workflows I could….
      • Understand how to operate it, configure it correctly with some examples and defaults, invoke it and handle all the error stuff, and predict performance properties
      • Know how expensive it might be to use (financially or performance)
      • Know when and by whom its was created, its version history and track its versions
      • Know what other people think of it, how popular it is and who else use it and how
      • Know how reliable it is, if it still works and how reliable it is and whether it keeps changing.
    • 22. If I had (well) curated services and workflows I could….
      • Get intelligent help with using it in my application, like when building workflows
      • Validate it
      • Know how it can be chained with others
      • Find services that can mediate the mismatches between other services.
      • Automagically match it up with others to automagically create new ones
      • Call it from an application or a web browser
    • 23. A definition for me [based on wikipedia]
      • Digital curation is about maintaining and adding value to a trusted body of digital assets for current and future use by, and on behalf of, a community.
      • It is a long term process where those assets are managed, cleaned up and corrected, associated with metadata, annotated and discussed, and appropriately preserved or reliably disposed of.
      • Assets are used, we hope
        • By applications and scientists who had anticipated using them.
        • By applications and scientists that had not, or in ways that were unanticipated.
    • 24. e-Scientists in the Cloud
      • Individual life scientists, in under-resourced labs, using other people’s applications, with little systems support.
      • Consumers are providers.
      • Exploratory.
      • A distributed, disconnected community of scientists.
    • 25. Hypo Science © Virtual Laboratories Science in the Small by the Many © Peter Murray-Rust
    • 26. Global Services in the Cloud
      • Independent third party world-wide service providers of applications, tools and data sets. In the Cloud. Hosted at the originators site.
      • Local applications, tools and datasets. My copies of third party services.
      • Special shim services.
      • Decoupled providers and consumers.
      • 3500 service operations
    • 27. But Surely ….
      • … Can’t I just Google (or Woogle) for a service?
      • The clustalw program from Emboss is called ‘emma’
      • … Can’t I look at its WSDL document?
      • Input0:string, Output0: string
      • What does SeqRet actually do?
      • Liberal use of polymorphic capabilities
      • What about the ones that are not Web Services?
      • … Can’t I look at its documentation?
      • Ahem.  We have to try them to find out what they do…
    • 28. Writing Reusable stuff is HARD
        • Predicting the unknown required by the unknown.
      • Services in the Wild are frequently Rubbish.
      • Scientists and Developers are naughty.
    • 29. Applications and Scientists need a Curated Registry of Services Note: Registry, not repository Services are hosted elsewhere (Just having a workflow system isn’t enough)
    • 30. Service Curation
      • 3500+ service operations
      • 600+ annotated by full-time curator.
      • myGrid Ontology
      • Annotation and curation pipeline
      • Curation tools
      • Feta and Find-O-Matic discovery tools
      • There are others:
        • DAS Registry
        • BioMOBY Central
      Since 2002
    • 31. Building Annotation Commodities Object Service Endpoint Workflow file etc Annotation Model Functional Operational Provenance Reputation Descriptions Ontologies Controlled vocabulary Tags Folksonomy Free text Layered, Enrichment, Augmentation Annotation model Uses Semantic Web technologies - OWL and RDFS The perspective of the scientist Managed, centralised curation process 700+ class domain ontology Service Ontology 3500+ Services
    • 32. Volatility and Decay
      • Services are not deposited and preserved.
      • They are referred to.
      • Constant, silent churn and flux.
      • No SLA to be stable or standard.
      • Constantly need tending or else they go bad and stale.
        • SeqHound, BioMART API
      • Rapid metadata heart-beat, especially on operational metadata. Like minutes.
      • (cf. IVOA service validation, DAS).
      • Workflow decay
      • Not Fix, File, Forget
    • 33. One size does not fit all…
      • Scientist - Finding
        • Simple classifications on a few properties. Smart tools. “Coarse grained”. Simple Ontology.
        • Decision Support
      • Automation – Validation and Execution “fine grained”
        • Rich metadata for automatic service configuration, invocation, debugging, repair, automated composition
        • Decision making.
    • 34. Increasing value Increased automation Better understanding Investment (cost, effort) Folksonomy Tagging Ontology Curation output{score} is_distance_between pair {input{sequence a}, input{sequence b}} ‘ ’ ‘ A tool to compare multiple protein structures ’ performs_task : alignment input_type{seq_a} : sequence… output_type{score} : d_value Scripted tool invocation Guided workflow construction Basic ‘discovery’ style service annotations Knowledge driven visualization Workflow validation Semantically enriched data Automated Workflow Construction Guided workflow reuse Dynamic Service Substitution Manual use of tools, web pages Naïve workflow systems Service Configuration
    • 35. Progressive Curation Just enough, Just in time Jam today and Jam tomorrow Gain Pain Very BAD Good, but Unlikely Just right
    • 36. Applications and Scientists needed a Curated Repository of Workflows Find a workflow like this one that I can edit to do something else. That’s really hard.
    • 37. Workflow Glass Boxes
      • Social Networks of Services
        • Is it dependent on a service I don’t have access to, or is depreciated or is unreliable?
      • Nesting and fragments of workflows
        • Workflow networks
      • Service Diagnostics
        • Popularity, Co-use and clustering
        • Quality of Service
      • Service Curation
        • Automate service annotation
        • Debug service annotations
    • 38. Our hard working (real) curators notice how tired they look Curation Sweatshop
      • Steady increase in numbers of services and workflows
      • Time-consuming and expensive.
      • Annotation and the Ontologies
      • Choosing, Adding value. Monitoring.
      • Should we instead enable suppliers to add value?
      Franck Tanoh Katy Wolstencroft
    • 39. Automated Curation
      • Operational:
        • Monitoring information services, dial home diagnostics from applications, customer reports
      • Reputation and Provenance:
        • Recommendations and ratings
      • Functional:
        • Text mining and parsing files and documents (if any)
        • Incidental metadata through use.
        • Annotation derivation from sound workflows and rich service descriptions of inputs and outputs
        • Not perfect, but a help!
      Needs lots of infrastructure Needs lots of seeding and reviewing
    • 40. Local Libraries and Warehouses of Workflows trapped in their enterprises or platforms
    • 41. Tryps Twiki World Wikis are where data lives….
    • 42.
      • Picture of workflow in Flicker – evidence of social tagging and networking
    • 43.  
    • 44. is…
      • A bazaar for any and all kinds of workflows.
      • A community social network for community annotation and general gossip.
      • A gateway to other publishing environments .
      • A federated repository .
      • Publish self-describing encapsulated myExperiment Objects.
      • Not workflows; Scientific Objects !
      • e-Crystals, Social science, Astronomy, Geography, Music
      • (A platform for launching workflows.)
      Since Feb 2007
    • 45.  
    • 46. Encapsulated myExperiment Objects.
      • A single or collection of workflows with instructions and examples
      • A workflow with its inputs and the products of executing it (including logs), perhaps multiple times
      • Chemistry data from instruments, coupled with blogged log book entries
      • A collection of all the digital items associated with one experiment—including EMOs
      • A reproducible article with workflows and data
      Virtual Exchange Format
    • 47. Encapsulated myExperiment Objects.
      • Open Archives Initiative – Object Reuse and Exchange (OAI-ORE)
        • compound object information and standardised and interoperable mechanisms
      • W3C Open Linked Data Initiative
      • Reproducible Scientific Objects
      Virtual Exchange Format x
    • 48. EMO Challenges
      • What happens when the parts are scattered across multiple stores?
      • What happens if someone updates a part?
      • How will my EMO be discovered on the Web?
      • How can I work with an EMO offline?
      • What is the provenance of the EMO and its parts?
      • What happens if a part is unavailable?
      24/5/2007 | myExperiment | Slide
      • How do I send an EMO by email?
      • Can I turn an EMO into a tarball?
      • Can I archive an EMO to a CDROM?
      • If I delete this file will it break anyone’s EMOs?
      • How do I trust an EMO?
      • How do I handle an EMO RESTfully?
      • Can my EMO link to objects outside the EMO?
    • 49. Not just Workflows, Not just Biology Chemistry - eCrystals Social Science Astronomy Music Files and Documents Logs and Blogs Ontologies Data
    • 50. Why EMO?
    • 51. Respect Cautious Collaboration…. 24/5/2007 | myExperiment | Slide
      • Community web site, federated repository.
      • Multiple and My.
      • Publish what I want when I want within the group I want.
      • Mixed identity regimes: an identity authority
      • OAI-MPH.
      • Open Archives Initiative.
      • The CombeChem project.
      cloud enterprise personal laboratory project
    • 52. A Gateway + more User Participation 24/5/2007 | myExperiment | Slide
      • Tryps team already has a wiki
      • Mash up with Facebook and workflow hosting apps.
      • Bring functionality to the user. Cooperate! Don’t Control.
      The Research Information Centre British Library and Microsoft Figure courtesy Savas Parastatidis , Microsoft
    • 53.  
    • 54. Apologies to Larson
    • 55. From me -Science to we -Science
      • Tribal bonding and sharing
      • Crossing Tribal Boundaries
      • Across communities and disciplines (MIT)
      • “ Intellectual Fusion” & “Swarming”; breaking down silos
      • Understanding outside my expertise. E.g. sources of error
      • Metadata challenges.
      • Social challenges.
    • 56. Curation by the Monks Curation by the Masses Automated Curation refine validate refine validate Curation by Developers seed seed refine validate seed A Change in the World The WS4LS BioCatalogue Project Manchester & EBI
    • 57. Challenges - where to start? If we thought about them hard we wouldn’t have done it. So we didn’t. Its, er, my experiment. National Centre for e-Social Science
    • 58. User Participation for Content and Functionality
      • Adoption depends on lots of shared services and workflows
      • and enabling Scientists to add value through applications and collaborative tagging
      • The Selfish Scientist –
      • e-Science is me-Science
      • Incentive models for Scientists to share?
    • 59.
      • We expect workflow versioning.
      • We encourage workflow evolution by the developers and others.
      • Versions to be re-pooled.
      • Ownership
      • Sharing
      • Permissions
      • Separate update of workflow from update of metadata.
      Workflow Versioning and Sharing
    • 60.
      • Control in the hands of the developers.
      • Is this flexible enough?
      • Sense of Ownership. IP. Authorship attribution. Copyright.
      • Provenance propagation.
      • Validation, Safety, Trust.
      • When does a workflow get changed so much its no longer the same workflow?
      Workflow Versioning and Sharing
    • 61. More Challenges
      • Privacy, Copyright, IP
      • Incentives to share, collaboratively curate and behave.
        • Altruism, mischief, self-interest
        • Credit, reputation, fame, impact. Me-Science.
        • Expectations – suppose its wrong? Will I get sued?
        • Scientists are naughty too.
      • Quality control.
        • Palpability, buyer beware, memes are tricky things. Community Trust models. Policing. Auto-checking? Shaming?
      • Sustainability leverages
        • The Open Source Development Model
        • On young peoples’ endless enthusiasm to share.
      • Better tooling.
    • 62. Keep your Users Close Web 2.0 Style development
      • Perpetual Beta
      • Users Add Value
      Parties HackFests Advocates Guinea Pigs
    • 63. Do we still need curators? “ Hell is other people’s metadata”
    • 64. Yes!
      • Open tagging, folksonomies, blogging, profiles, recommendations, Social network analysis and e-tracking, workflow analytics.
      • Deafened by the Shouting
      • Overseeing but not Controlling. Review and add value.
      • Tagging -> Structured Pipeline
      • Reconcile Creative Freewheeling with need to Organise.
        • Impedance mismatch between research activities and the recording of research data. Dynamic Scientists vs Prescriptive Platform
      • Ontology dictatorship.
        • Reconciling managed ontologies with emergent folksonomies. Encourage Tagging with Ontologies.
      • Metadata Creep: multi-form, multiple-descriptions
    • 65. Pay as you Go, Emergent Curation Gain Pain Very BAD Good, but Unlikely Just right Folksonomy Tagging Hard Core Ontology Curation
    • 66. Must be careful to avoid technology seduction Computer people want to do interesting stuff; curators want stability and reliability; users want simplicity. Smart tools and good interfaces often outwit clever techniques. Bummer. However….
    • 67. Model Flexibility
      • Semantic Web!
        • Flexibility of RDF
        • Incrementality of OWL
        • Self description
        • Reasoning when needed
        • Open Linked Data, SKOS
      • Open Archives Initiative – Object Reuse and Exchange (OAI-ORE)
        • compound object information and standardised and interoperable mechanisms
    • 68. Metadata Middleware
      • Annotations are First Class Citizens
      • A technology independent metadata abstraction layer. Natively supported by the middleware infrastructure.
      • S-OGSA Framework from the Semantic Grid.
      • Semantic Bindings Management.
    • 69. Curation Design Patterns
      • The Long Tail
      • Data is the Next Intel Inside
      • Users Add Value
      • Network Effects by Default
      • Some Rights Reserved
      • The Perpetual Beta
      • Cooperate, Don't Control
      • Beyond a Single Device
    • 70. SMARTER Curation
      • S elective – ROI
      • M ass community annotation – cooperate don’t control. Harness people cycles and network effects.
      • A utomate – Derive. Harness compute cycles and network effects.
      • R eact – to changes, automate responses
      • T imely – just in time
      • E xpedient – just enough
      • R eview – seed, oversee & refine rather than control
      • Changes in model support and infrastructure
      • Changes in work practice – if it’s a problem, it’s a people problem.
    • 71. Credits
      • David De Roure
      • Matt Lee
      • David Withers
      • Don Cruickshank
      • Jiten Bhagat
      • David Newman
      • Mark Borkum
      • Danius Michaelides
      • Ed Zaluska
      • Jeremy Frey
      • Simon Coles
      • Marco Roos
      • Rob Procter
      • Alex Voss
      • Duncan Hull
      • Paul Fisher
      • Antoon Goderis
      • Katy Wolstencroft
      • Franck Tanoh
      • Robert Stevens
      • Martin Senger
      • Khalid Belhajjame
      • Andy Brass
      • Norman Paton
      • Rodrigo Lopez (EBI)
      • Tom Oinn (EBI)
      • Pinar Alper, Phil Lord, Chris Wroe
      • Mark Wilkinson (BioMOBY)
      • Savas Parastatidis (Microsoft)
      • Alan Williams, Stuart Owen, June Finch, Stian Soiland,
      • Kaixuan Wang, Oscar Corcho
      • And the rest of my Grid and OntoGrid
    • 72. For More Information
      • myExperiment:
        • David De Roure
      • myGrid: Taverna and WS4LS Catalogue
      • SoapLab:
      • OntoGrid: Semantic middleware