Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Being FAIR: FAIR data and model management SSBSS 2017 Summer School

Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18

  • Login to see the comments

  • Be the first to like this

Being FAIR: FAIR data and model management SSBSS 2017 Summer School

  1. 1. Being FAIR: FAIR data and model management Professor Carole Goble, carole.goble@manchester.ac.uk The University of Manchester, UK The FAIRDOM Association Coordinator ELIXIR-UK Head of Node Co-lead ELIXIR Interoperability Platform SSBSS 2017, July 17 2017, Cambridge, UK 4th International Synthetic & Systems Biology Summer School
  2. 2. Data-driven and predictive biology Data, Software, Models, SOPs….MATTER Not a by-product. It’s the fuel. The assets. modellers experimentalists
  3. 3. Why Data Management http://fair-dom.org https://www.youtube.com/wat ch?v=N2zK3sAtr-4 https://www.youtube.com/watch?v=PWutnWBfUSw
  4. 4. SystemsApproach: Context + more than Data models, data, SOPs, samples, strains, publications…. multiple, interrelated assets. multiple, dispersed repositories Multiple omics: genomics, transcriptomics proteomics, metabolomics, fluxomics, reactomics Images, molecular biology, reaction kinetics… SOPs, sample and strain metadata… Models: Metabolic, gene network, kinetic… Scripts and workflows The relationships between… Tracking: versions, provenance, parameters… Citation and credit… Standards fairsharing.org
  5. 5. More than simple supplementary materials 16 datafiles (kinetic, flux inhibition, runout) 19 models (kinetics, validation) 13 SOPs 3 studies (model analysis, construction, validation) 24 assays/analyses (simulations, model characterisations) Penkler, G., du Toit, F., Adams, W., Rautenbach, M., Palm, D. C., van Niekerk, D. D. and Snoep, J. L. (2015), Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum. FEBS J, 282: 1481–1511. doi:10.1111/febs.13237
  6. 6. SyntheticApproach - Automation • Automate data management – Spreadsheets, instruments, LIMS… – Replication, comparison • Support automation – Tracking successful products from plasmids – Informing robots – Incorporate into pipelines and workflows – Mediate through samples – Standards [Courtesy: Andrew Millar]
  7. 7. Systems Approach…Collaboration teams, disciplines, partners What methods are been used to determine enzyme activity? What SOP was used for this sample? Where is the validation data for this model? Is there any group generating kinetic data? Is this data available? Track versions of my model Whats the relationship between the data and model? Which data belong to which publications?
  8. 8. modellers experimentalists
  9. 9. End to end Management Project Boot up, Run andWashup • Capture • Track • Organise & Link • Curate • Report • Exchange • Retain • Integrate • Reuse other systems • Support data-driven processes CREATING DATA PROCESSING DATA ANALYSING DATA PRESERVING DATA ACCESS TO DATA RE-USING DATA
  10. 10. The FAIR Guiding Principles for scientific data management and stewardship https://www.nature.com/articles/sdata201618 (2016)
  11. 11. The greater good…. Access to public funded research, Reproducible results Value and cite all research outputs https://www.nature.com/articles/sdata201618 (2016)
  12. 12. UK Funder Data Policies http://www.dcc.ac.uk/resources/policy-and-legal/overview-funders-data-policies Compliance and Policy Data Management Plans
  13. 13. https://wellcomeopenresearch.org/ Nature Scientific Data Data (and software) as a first class citizen Data (and software) Citation Scholarly Communications Providers
  14. 14. The Personal good…. • reviewers want additional work • statistician wants more runs • analysis needs to be repeated • post-doc leaves, • student arrives • new/revised datasets • updated/new versions of algorithms/codes • sample was contaminated • better kit - longer simulations • new partners, new projects Personal & Lab Productivity Sharing Reproducibility
  15. 15. Catalogues Standards: identifiers, metadata Stores Policy, Identifiers, Authorised Access & Licensing
  16. 16. Standards are not always used.... Formats MetadataMetadata reporting guidelines Ontologies *top three most popular The evolution of standards and data management practices in systems biology (2015). Stanford et al, Molecular Systems Biology, 11(12):851
  17. 17. … model reuse and reproducibility tricky… Stanford et alThe evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053
  18. 18. Catalogues, Storage and Publishing Active Data, Published Data local/project LIMS, data management, analytics. active data. global, public, central subject-specific databases published data. ACT LOCALTHINK GLOBAL
  19. 19. Cloud services figshare zenodo Amazon Web Services Google Cloud Azure, EBI Embassy Cloud Own cloud FAIRDOMHub mendeley data Cloud Data Services Cloud hosting services OpenAIRE
  20. 20. Catalogues, Storage and Publishing Active Data, Published Data Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053
  21. 21. Catalogues, Storage and Publishing Active Data, Published Data Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053 Type specific archives Fragmented silos
  22. 22. Catalogues, Storage and Publishing Active Data, Published Data Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053 Type specific archives Fragmented silos Experimental context All together
  23. 23. Catalogues, Storage and Publishing Active Data, Published Data Central Repository Centric Research Infrastructure for FAIR Data for Life Sciences in Europe Top down, 21 National Nodes + EMBL-EBI Project Centric FAIR Research Data Management for Data, SOPs, Models for Systems and Synthetic Biology Projects Grass roots Association of Institutions and members funded by 4 EU countries. http://www.fair-dom.org http://www.elixir-europe.org
  24. 24. modellers experimentalists
  25. 25. FAIRDOM Consortium Since 2008…. ERANets and ERACoFunds National Programmes National Centres EU Research Infrastructures Sponsors:
  26. 26. Built by Project PALs post docs, postgrads and techs
  27. 27. FAIRDOM FAIRDOM Software Platform+Tools A Central Public Hub for Projects Customised Project Installations Project Stewardship Consultancy Services Community Activities 80 Projects 30+ Installations
  28. 28. http://fair-dom.org/knowledgehub/data-management-checklist/ https://dmponline.dcc.ac.uk/ http://dmp.fairdata.solutions/ (very early alpha)
  29. 29. FAIR Checklists Making Data Findable (documentation and metadata management) • What documentation and metadata will accompany the data (assist its discoverability)? (Details on methodology, definitions, procedures, SOPs, vocabularies, units, dependencies, etc) • What information is needed for the data to be read and interpreted in the future? • What naming conventions will be used? • How will you approach versioning your data? • How will you capture / create this documentation and metadata? • How do you ensure the completeness of the captured data? Making Data Accessible Specify which data will be made openly available taking into consideration • What ethics and legal compliance issues do you have if any? Do you need consent for data preservation and sharing? Do you have to protect certain data? Is any data sensitive? • Do you think you might have Intellectual Property Rights issues? Have you considered ownership of the data, licensing, restrictions on use? • Do you think you will need to embargo any data? • How will you make the data available? (consider the platforms you will use: databases, repositories, etc) • What methods or software tools are needed to access the data? shoudl you include documentation detailing how to access use/access the software that is needed for accessing the data? Is it possible to include this software with the data (e.g. source code, docker etc) • If there are any restrictions on accessibility, how will you provide access? Making Data Interoperable • What standards (metadata vocabularies, formats, checklists) or methodologies will you use? • How do you address data and model quality?What validation steps do you foresee? • Will you use standardised vocabulary for all data types to allow inter-disciplinary interoperability? • Where you can not used standardised vocabulary for all types of data, can you map to more commonly used ontologies? Making Data Re-usable • How will you licence your data to permit the widest re- use possible? • When will the data be made available for re-use? Does this include an embargo period? (if so, why?) • Which data will be available for re-use during/after the project? If not, why? • What are your data quality assurance processes? • How long do you expect your data to remain re-usable?
  30. 30. Community Actions http://www.fair-dom.org Samples Club Developers Club
  31. 31. Stewardship Support 500K needed*, a new career needing a career path *European Open Science Cloud Report
  32. 32. FAIRDOM Platform Free and Open Source Front end Project(s) Hub Back end Onsite storage & analytics On site Tracking, data analytic pipelines, Extract,Transform and Load direct from the instruments, large data management LIMS, auto-archiving Web-based portal Project controlled spaces Metadata catalogue &Yellow pages Results repository, dissemination and collaboration Tool gateway Built using Built using
  33. 33. Back end Instrument Data Management, LIMS, ELN Samples Protocols Experiment Description Raw Data Analysis Scripts Results Laboratory Notebook & Inventory Manager ELN LIMS-like linking data to biological materials • samples+protocols management • data management • experimental description Big Data analytics on distributed compute resources
  34. 34. • Project controlled protected spaces – Working space, show space for results – Supp. materials space for publications – Yellow pages and collaboration – Upload or link to data • One place catalogue – Regardless of physical store – Organised is ISA with shared metadata – Standards-compliant • Linked with other systems – Project on-site (secure) repositories – Public deposition archives – Integrated with JWSOnline modelling tools Front End Hub: A Commons one place to Find, Access and organise assets “Using FAIRDOMHub my own lab colleagues saw what I was doing and called to collaborate!”
  35. 35. 859 people 80 projects 198 institutions FAIRDOMHub.org Public Commons self managed workspaces, controlled sharing, shared metadata yellow pages
  36. 36. More than simple supplementary materials 16 datafiles (kinetic, flux inhibition, runout) 19 models (kinetics, validation) 13 SOPs 3 studies (model analysis, construction, validation) 24 assays/analyses (simulations, model characterisations) Penkler, G., du Toit, F., Adams, W., Rautenbach, M., Palm, D. C., van Niekerk, D. D. and Snoep, J. L. (2015), Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum. FEBS J, 282: 1481–1511. doi:10.1111/febs.13237
  37. 37. Investigation Study Analysis Data Model SOP(Assay) https://fairdomhub.org/investigations/56
  38. 38. Catalogue across repositories regardless of location federated stores retaining context to support decision making and reuse bridging local and global In House Stores External Databases Publishing services Secure Stores Model Resources Upload or Reference
  39. 39. Protected spaces, sharing sensitivities Open science applies to you but not me… not available, not citable. Licenses Negotiated access Embargos Permission controls Staged sharing
  40. 40. Act LocalThink Global Cloud Service .org Local retention In flight management, Private sharing Customisation Centres, large projects National projects Local skills for admin support Post-project retention One stop showcase Self-managed sharing Supplementary materials Off-the-shelf features Hosted on behalf of users Delegated admin support Long term repository • Trusted repository • Guaranteed until 2029 • Long term maintenance • Sustainability • 1TB per project stored centrally. • Much more catalogued.
  41. 41. Hub common space, one place to organise and report your assets .org Nucl. Acids Res. (2016) doi: 10.1093/nar/gkw1032 70+ Projects 30+ Installations Public & cloud Subject and Datatype archives
  42. 42. Typical Data Flows HTP data processing management exchange deposition publishing reporting ORGANISATION COMMUNICATION samples analytics models, SOPs processed data DISSEMINATION Less data, more metadata, potentially wider access processed data
  43. 43. Publishing…snapshot and assign DOIs Credits and Citations G. Penkler, F. Du Toit, W. Adams, M. Rautenbach, D. C. Palm, D. D. Van Niekerk, & J. L. Snoep. (2014). Glucose metabolism in Plasmodium falciparum trophozoites. FAIRDOMHub. http://doi.org/10.15490/seek.1.investigati on.56 Snapshot to fix state with particular versions Assign a DOI Entry has citation metadata Use in journals and in metrics systems Active entry continues to evolve Fenner et al, A Data Citation Roadmap for Scholarly Data Repositories doi: https://doi.org/10.1101/097196
  44. 44. 18/07/2017 44 An “evolving manuscript” would begin with a pre- publication, pre-peer review “beta 0.9” version of an article, followed by the approved published article itself, [ … ] “version 1.0”. Subsequently, scientists would update this paper with details of further work as the area of research develops. Versions 2.0 and 3.0 might allow for the “accretion of confirmation [and] reputation”. Ottoline Leyser […] assessment criteria in science revolve around the individual. “People have stopped thinking about the scientific enterprise”. http://www.timeshighereducation.co.uk/news/evolving-manuscripts-the-future-of-scientific-communication/2020200.article
  45. 45. Retention: Moses from the ERANET SysMO Programme Project ended in 2010 Publication in 2014/2015 Using data from 2012 [Maxim Zakhartsev]
  46. 46. [Adapted from Ursula Klingmüller, Martin Böhm] Excemplify Antibody Database FAIR collaboration from the ERANet ERASysAPP
  47. 47. 47 Programme Overarching research theme (The Digital Salmon) Project Research grant (DigiSal, GenoSysFat) Investigation A particular biological process, phenomenon or thing (typically corresponds to [plans for] one or more closely related papers) Study Experiment whose design reflects a specific biological research question Assay Standardized measurement or diagnostic experiment using a specific protocol (applied to material from a study) Jon Olav Vik, Norwegian University of Life Science Integration with Norway’s national einfrastructure for Life Science (NeLS)
  48. 48. Specialist databases Local Biochem4j ICE Global Brenda, wikipathways, Biomodels ICE Public Deposition Databases Public Catalogues Tracking in Specialist Systems Institutional Catalogue & Repository
  49. 49. Specialist databases Local Biochem4j ICE Global Brenda, wikipathways, Biomodels ICE Public Deposition Databases Public Catalogues Institutional Catalogue & Repository Tracking in Specialist Systems
  50. 50. Ubiquitous Spreadsheet • Unifying processes • Common spreadsheet models – Consistency and quality of collaboration – Common identifier meanings – Metadata collection Tracking in Specialist Systems
  51. 51. http://www.fairdomhub.org https://sandbox1.fairdomhub.org • empty box for safe playing • copy the investigation that is there • add your name to the guest list so we don’t double up - http://tinyurl.com/sandboxlist Try out for yourself…
  52. 52. The first steps? • Metadata design • Samples – The link between everything • The ubiquitous spreadsheet – Templates and exchange… – Unifying processes – Carrying best practice Image from FAIRSharing.org
  53. 53. Use and reuse standard identifiers General standards Site specific Community standards e.g. SynBioChem ICE Strain convention A URL preferably to identifiers.org that resolves to the description of the host strain in NCBI taxonomy e.g. e-Coli DH5α http://identifiers.org/taxonomy/668369 location independent resolvable identifiers (URIs) decoupling the identification of records from their physical locations
  54. 54. Investigation: Glucose metabolism in P. falciparum trophozoites Study: Model construction Study: Model validation Assay: LDH Assay: PK Assay: ENO Assay: PGM Assay: PGK Assay: GAPDH Assay: TPI Assay: ALD Assay: PFK Assay: PGI Assay: HK Assay: GLCtr Assay: PYRtr Assay: LACtr Assay: G3PDH Assay: GLYtr Assay: ATPase Data: GLCtr Model: GLCtr Data: HK Model: HK Steady state Incubation penkler1 Validation data penkler2 Validation data ... ... SOP: GLCtr SOP: HK ... SOP: Validation Assay: Culturing Assay: Lysate prep. SOP: Culturing SOP: Lysate prep. Design an ISA (Investigation, Study, Assay/Analysis) structure. Devising this makes you think…..
  55. 55. Use FAIRData and Metadata Standards help to improve understanding and exchange…. Credit: Nicolas Le Novère, Babraham Institute, UK, adapted. represents genetic designs - standardized vocabulary of schematic glyphs - standardized digital format. ICE, SBOLStack, iGEM CIMR Core Information for Metabolomics Reporting MIABE Minimal Information About a Bioactive Entity MIACA Minimal Information About a Cellular Assay MIAME Minimum Information About a Microarray Experiment MIAME/Nutr MIAME / Nutrigenomics MIAME/Plant MIAME / Plant transcriptomics MIAME/Tox MIAME /Toxicogenomics MIAPA Minimum Information About a Phylogenetic Analysis MIAPAR Minimum Information About a Protein Affinity Reagent MIAPE Minimum Information About a Proteomics Experiment MIARE Minimum Information About a RNAi Experiment MIASE Minimum Information About a Simulation Experiment
  56. 56. Where do I go for standards information? Linking models…. • connecting (experimental/simulation) data to models • connecting the single standards? • interfacing between the different scales? https://fairsharing.org/collection/FAIRDOM
  57. 57. How do I design the metadata? Metadata ramps Metadata Registration and Use
  58. 58. Metadata ramps: spreadsheet templates Tooling for annotations and checklist templates for different types of assay data. Embed ontologies into Excel templates Excel spreadsheets enriched with ontology annotations Upload, extract metadata and register http://www.rightfield.org.uk
  59. 59. Ramping up Samples Spreadsheets! A new framework for Syn and Sys Bio Samples are Inputs and Outputs…. compliant
  60. 60. Sounds hard…. what can I do? 12 steps to being FAIR plan to be born FAIR 1. plan data management lifecycle: plan, cost and implement pathways and storage including what you will archive, what you will throw away, how you will collect metadata and how you will curate throughout 2. use standard identifiers and identifier standards 3. use metadata standards with data provenance 4. catalogue / register data with metadata 5. have access and sharing policies with licenses 6. use data (assets) management platforms and tools that work together 7. deposit into public archives 8. have a sustainability / end project plan 9. resource and support, and that also means people too 10. embed data management into work practices and do some training 11. give credit 12. check if you have sensitive data issues
  61. 61. What can you do? • Make a Data Management Plan (check the checklist). • Get an account on the FAIRDOMHub or install your own. • Define and share your SOPs. • Who is your group’s data steward? • How are they getting credit? • Know your local data management policies and resources. • Get some training. • Educate your supervisors, institutions and peers. • Build some metadata ramps
  62. 62. The Data Steward function, profession, cultural shift • 500,000 needed in Europe* • Specialist skills • Career pathways • Recognition Curation and management • Supported, Resourced • Recognised, Rewarded Sharing policy and practice embedded * Realising the Open European Science Cloud (2016)
  63. 63. Jon OlavVik, Norwegian University of Life Science Maksim Zakhartsev University Hohenheim, Stuttgart, Germany Alexey Kolodkin Siberian Branch Russian Academy of Sciences Tomasz Zieliński, SynthSys Centre University Edinburgh, UK Martin Peters, Martin Scharm Systems Biology Bioinformatics University of Rostock, Germany
  64. 64. Reading List • Wolstencroft et al (2016). “FAIRDOMHub: a repository and collaboration environment for sharing systems biology research”. NucleicAcids Research, 45(D1): D404-D407. DOI: 10.1093/nar/gkw1032 • Rice and Southal,The Data Librarian's Handbook, Wiley Publishing, 2016 • Stanford et alThe evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053 • Wilkinson et alThe FAIR Guiding Principles for scientific data management and stewardship, https://www.nature.com/articles/sdata201618 (2016) • McMurry, Juty, et al. (2017) Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data. PLoS Biol 15(6): e2001414. https://doi.org/10.1371/journal.pbio.2001414 • Fenner et al, A Data Citation Roadmap for Scholarly Data Repositories doi: https://doi.org/10.1101/097196 • Realising the Open European ScienceCloud https://ec.europa.eu/research/openscience/pdf/realising_the_european_open_science _cloud_2016.pdf
  65. 65. Website list • FAIRDOM http://www.fair-dom.org • FAIRDOMHub http://www.fairdomhub.org • Rightfield http://www.rightfield.org.uk • FAIRSharing http://www.fairsharing.org • ELIXIR http://www.elixir-europe.org • Software Carpentry https://software-carpentry.org/ • DataCarpentry http://www.datacarpentry.org/ • Sandbox https://sandbox1.fairdomhub.org • empty box for safe playing • copy the investigation that is there • add your name to the guest list so we don’t double up - http://tinyurl.com/sandboxlist

×