Data sharing - Data management - The SysMO-SEEK Story


Published on

Professor Carole Goble, University of Manchester, talks at the RIN "Research data: policies & behaviour" event as part of a series on Research Information in Transition.

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Learn about JISC’s work in the area of shared services for STEM subjects, particularly the JANET network service and virtual research environments (i.e., web tools for helping research processes)
    Explore new opportunities for research being opened up via shared services, and also the economic savings this creates
    Consider the role their university might play in providing a shared service to other institutions
  • Nor major data centres but long tail
  • Data pipeline
    Data funnel
    Fuzzy line between collaborators and competitors
    Usb drives, wikis, databadsaes,
    Disributed in email etc.
  • Sharing without fear
  • MaDaM project
    Competitive advantage.
    Academic vanity.
    Novel insights.
    Being scooped.
    Not comprehensible
    Competitive advantage.
    Academic vanity.
    Being scooped.
    New Reward Schemes
    But we have to aware of the drivers for collaboration.
    Competitive advantage.
    Be the first with the Nature paper.
    Academic vanity
    Credit, credibility, fame, acclaim,
    recognition, peer respect, reputation.
    Get my stuff adopted / recognised
    More funding
    Being found out
    Open to rigorous inspection.
    Being scooped
    Beaten by lab X
    Protecting my turf.
    Releasing results too early.
    Getting left behind. Being out of fashion.
    Looking stupid
    Being misinterpreted or misrepresented.
    Looking stupid. Losing control. Taking a risk
  • Some excuses
  • Genomics Standards Consortium
    All or nothing
  • Scripts, workflows, simulations, experimental plans statistical models….
    Repeatable, reproducible, comparable and reusable research.
    Propagate expertise
    Build reputation.
  • Credit, Citation, Career
    Personal and institutional visibility
    Scholarly citation metrics
  • contribute, curate, review, reuse.
    Data is not respected
    . John Quackenbush - John Quackenbush - Professor of Computational Biology and Bioinformatics - Department of Biostatistics - Harvard School of Public Health.
  • 58% developed by students, 24% stated not maintained
    (Schultheiss et al. (2010) PLoS Comp Biol (in review))
    Tools, commons
    Preparing data for sharing is free like puppies are free
  • National Centre for BioOntologies
    The Open Biological and Biomedical Ontologies
    Standardise messages not structures
    Only as good as your data services
    Minimum models and Controlled vocabularies
  • 58% developed by students, 24% stated not maintained
    (Schultheiss et al. (2010) PLoS Comp Biol (in review))
    Tools, commons
    Preparing data for sharing is free like puppies are free
    Doi’s cost
  • Hard core are the PALs
    Commons-based Cleanup
    ● Manual and automated curation workflows ● Curators emergent and assigned ● Curation tools
    Right time right place – also email!
    Third party curation is really hard
    Expert curation
    Added value
    Structured metadata
    Facetted browsing
    Time to get organised
    One example workflow can be found at: This the the old example workflow, but I have tagged as a benchmark. You can see the breakdown of tags given to this at: ... or by clicking on the breakdown section (see attached image). 14 curation tags Some are slightly ambiguous and others have little meaning These were:    * test workflow    * component - part of whole solution    * whole solution    * tutorial / example    * incomplete    * junk    * obsolete - deprecated    * runnable    * not runnable    * requires description    * requires credit / attribution    * requires example input data    * description; [Description Text]    * example data; [port : value] Each tag was preceeded by a "c:" so that it would be picked up by the myExperiment plugin and could be differentiated from other myExperiment tags. If some example data was known, I tried to add it to using the example tag "example data; [port : value]", where the port name is given, along with the data to be put into the port. The whole process was very time consuming, as I had to try and open each workflow in T2, run it using some example data (or figure out what it did and run it with lots of test data), and then add each comment (checking each workflow on myExperiment to see if it had complete properly.
  • Add url here
  • E-Lab and Taverna – all my software - elephants ---- elephant in the room, blind men and elephants, danger of being white elephants?
    And other e-Science projects
    Each of these apply to all our projects. Just one of them is not enough. Not even for Taverna.
    To sustain it as a service we must sustain the software and the content in its repositories
  • Data sharing - Data management - The SysMO-SEEK Story

    1. 1. Data sharing Data management The SysMO-SEEK Story Professor Carole Goble FREng FBCS CITP University of Manchester, UK
    2. 2. 13 teams 91 institutes, 300 scientists Multi-site, multi-disciplinary Each three year duration Data generation Data consumption Data analysis Data management: Local – Shared – Long term Pan European Systems Biology
    3. 3. Own data solutions. wikis, e-Groupware, PHProjekt, BaseCamp, PLONE, Alfresco, bespoke commercial … files and spreadsheets. Extreme caution over sharing. Modellers vs experimentalist tribalism Many institutions, many projects, overlapping memberships, changing membership. Projects ending, starting, carrying on the same, carrying on differently. Legacy Suspicion Dynamics Expert scientists, inexpert informaticians. Few resources. Skills Patchy standards, incomparable data, afterthought. Data
    4. 4. Scientist Lab Collaborators Competitors ProgrammePublished Post- Publication Pre- Publication
    5. 5. Data mine-ing “my impression of researchers, and I can criticize myself in this, is that we’re much more interested in sharing data when we mean sharing somebody else’s as opposed [to] sharing ours.” E-infrastructure - taking forward the strategy, RIN report, 2010
    6. 6. Competitive advantage. Adoption. Kudos & Credit. Help. Fame. Reputation. Being scooped. Scrutiny. Misinterpretation. Cost. Blame. Reputation. RewardsRisks Nature 461, 145 (10 September 2009) 1. Sharing
    7. 7. “It’s not ready yet” “I need to get (another) publication first” “We don’t have the resources or skills to prepare it for others, esp. now we finished that project” “Its faster/easier to do it myself, and will keep the credit/control too” “Its not described enough to be usable” “I don’t trust the quality. Its not reliable enough. Its too noisy. “Others won’t use it properly.” “It’s not worth my while”“They are my competitors!!”
    8. 8. Pseudo Sharing
    9. 9. 2. Preparation for Use Curation Standards Reusability Reproducibility Accountability & Quality Data discipline Silo busting
    10. 10. CIMR Core Information for Metabolomics Reporting MIABE Minimal Information About a Bioactive Entity MIACA Minimal Information About a Cellular Assay MIAME Minimum Information About a Microarray Experiment MIAME/Env MIAME / Environmental transcriptomic experiment MIAME/Nutr MIAME / Nutrigenomics MIAME/Plant MIAME / Plant transcriptomics MIAME/Tox MIAME / Toxicogenomics MIAPA Minimum Information About a Phylogenetic Analysis MIAPAR Minimum Information About a Protein Affinity Reagent MIAPE Minimum Information About a Proteomics Experiment MIARE Minimum Information About a RNAi Experiment MIASE Minimum Information About a Simulation Experiment MIENS Minimum Information about an ENvironmental Sequence MIFlowCyt Minimum Information for a Flow Cytometry Experiment MIGen Minimum Information about a Genotyping Experiment MIGS Minimum Information about a Genome Sequence MIMIx Minimum Information about a Molecular Interaction Experiment MIMPP Minimal Information for Mouse Phenotyping Procedures MINI Minimum Information about a Neuroscience Investigation MINIMESS Minimal Metagenome Sequence Analysis Standard MINSEQE Minimum Information about a high-throughput SeQuencing Experiment MIPFE Minimal Information for Protein Functional Evaluation MIQAS Minimal Information for QTLs and Association Studies MIqPCR Minimum Information about a quantitative Polymerase Chain Reaction experiment MIRIAM Minimal Information Required In the Annotation of biochemical Models MISFISHIE Minimum Information Specification For In Situ Hybridization and Immunohistochemistry Experiments STRENDA Standards for Reporting Enzymology Data TBC Tox Biology Checklist BioPAX : Biological Pathways Exchange FuGE Functional Genomics Experiment Minimum Information for Biological and Biomedical Investigations Metadata Minefield
    11. 11. Publishing Process models software methods scripts standard operating procedures
    12. 12. Community Curation Responsiblity
    13. 13. Blue Collar Science John Quackenbush Difficult and time consuming Poor Credit or Reward Shabby Career Paths & Prospects
    14. 14. 3. Credit Crisis • Reward sharing, curation and reuse rather than reinvention. • Credit. Attribution. Citation. • For software, methods and standards too. • Technical ( • Cultural (Respected policy). • Institutional. • Funding bodies.
    15. 15. 4. Infrastructure, Capability & Capacity • Three year PhD/project cycle • Local data control • Realistic paths to adoption by busy people. • Spreadsheets, wikis, catalogues and yellow pages. • Content and Tools
    16. 16. Identity Management Sharednames DataCite LSID DOIs ORCID 5. Data Ecosystem Resources
    17. 17. 6. Sustained Resources • Three year projects. • Three year lifespan of data (and its software). • Sunsets and Sustains • Reinvention rewarded • Institution. • Funding councils. • Funding panels. • Publishers • Libraries • National data centres • International data centres
    18. 18. Incentives. Sensitivity to Behaviours Infrastructure Community building Trusted service Coordination Governance Policy Capability Community Integration
    19. 19. A Partnership • Software engineers • Computational scientists • Experimental Scientists • Domain informaticians • Service providers • Funding agencies • But the community credit crisis continues….
    20. 20. Summary • Science is a complex social activity undertaken by tribes of people and dominated by trust issues. • Infrastructure has to be there and fit for purpose but its not the real the problem. • Need a cultural shift (on all sides) that truly honours data.