Being FAIR:
FAIR data and model
management
Professor Carole Goble, carole.goble@manchester.ac.uk
The University of Manchester, UK
The FAIRDOM Association Coordinator
ELIXIR-UK Head of Node
Co-lead ELIXIR Interoperability Platform
SSBSS 2017, July 17 2017, Cambridge, UK
4th International Synthetic & Systems Biology Summer School
Data-driven and predictive biology
Data, Software, Models, SOPs….MATTER
Not a by-product.
It’s the fuel.
The assets.
modellers
experimentalists
Why Data Management
http://fair-dom.org
https://www.youtube.com/wat
ch?v=N2zK3sAtr-4
https://www.youtube.com/watch?v=PWutnWBfUSw
SystemsApproach: Context + more than Data
models, data, SOPs, samples, strains, publications….
multiple, interrelated assets. multiple, dispersed repositories
Multiple omics: genomics, transcriptomics
proteomics, metabolomics, fluxomics,
reactomics
Images, molecular biology, reaction kinetics…
SOPs, sample and strain metadata…
Models: Metabolic, gene network, kinetic…
Scripts and workflows
The relationships between…
Tracking: versions, provenance, parameters…
Citation and credit…
Standards
fairsharing.org
More than simple supplementary materials
16 datafiles (kinetic, flux inhibition, runout)
19 models (kinetics, validation)
13 SOPs
3 studies (model analysis, construction,
validation)
24 assays/analyses (simulations, model
characterisations)
Penkler, G., du Toit, F., Adams, W., Rautenbach, M.,
Palm, D. C., van Niekerk, D. D. and Snoep, J. L. (2015),
Construction and validation of a detailed kinetic model
of glycolysis in Plasmodium falciparum. FEBS J, 282:
1481–1511. doi:10.1111/febs.13237
SyntheticApproach - Automation
• Automate data management
– Spreadsheets, instruments, LIMS…
– Replication, comparison
• Support automation
– Tracking successful products from plasmids
– Informing robots
– Incorporate into pipelines and workflows
– Mediate through samples
– Standards
[Courtesy: Andrew Millar]
Systems Approach…Collaboration
teams, disciplines, partners
What methods are been used to determine
enzyme activity?
What SOP was used for this
sample?
Where is the validation data for this model?
Is there any group generating kinetic data?
Is this data available?
Track versions of my model
Whats the relationship between the data and
model?
Which data belong to
which publications?
modellers
experimentalists
End to end Management
Project Boot up, Run andWashup
• Capture
• Track
• Organise & Link
• Curate
• Report
• Exchange
• Retain
• Integrate
• Reuse other systems
• Support data-driven processes
CREATING
DATA
PROCESSING
DATA
ANALYSING
DATA
PRESERVING
DATA
ACCESS
TO DATA
RE-USING
DATA
The FAIR Guiding Principles for scientific data management and stewardship
https://www.nature.com/articles/sdata201618 (2016)
The greater good….
Access to public funded research, Reproducible results
Value and cite all research outputs
https://www.nature.com/articles/sdata201618 (2016)
UK Funder Data Policies
http://www.dcc.ac.uk/resources/policy-and-legal/overview-funders-data-policies
Compliance and Policy
Data Management Plans
https://wellcomeopenresearch.org/ Nature Scientific Data
Data (and software) as a first class citizen
Data (and software) Citation
Scholarly Communications Providers
The Personal good….
• reviewers want additional work
• statistician wants more runs
• analysis needs to be repeated
• post-doc leaves,
• student arrives
• new/revised datasets
• updated/new versions of
algorithms/codes
• sample was contaminated
• better kit - longer simulations
• new partners, new projects
Personal & Lab
Productivity
Sharing
Reproducibility
Catalogues
Standards: identifiers, metadata
Stores
Policy,
Identifiers,
Authorised
Access &
Licensing
Standards are not always used....
Formats MetadataMetadata reporting
guidelines
Ontologies
*top three most popular
The evolution of standards and data management practices in systems biology (2015). Stanford et al,
Molecular Systems Biology, 11(12):851
… model reuse and reproducibility tricky…
Stanford et alThe evolution of standards and data management practices in systems biology,
Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053
Catalogues, Storage and Publishing
Active Data, Published Data
local/project LIMS, data management,
analytics. active data.
global, public, central subject-specific databases
published data.
ACT LOCALTHINK GLOBAL
Cloud services
figshare
zenodo
Amazon Web Services
Google Cloud
Azure, EBI Embassy Cloud
Own cloud
FAIRDOMHub
mendeley data
Cloud Data Services
Cloud hosting services
OpenAIRE
Catalogues, Storage and Publishing
Active Data, Published Data
Stanford et al The evolution of standards and data
management practices in systems biology, Molecular
Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053
Catalogues, Storage and Publishing
Active Data, Published Data
Stanford et al The evolution of standards and data
management practices in systems biology, Molecular
Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053
Type specific archives
Fragmented silos
Catalogues, Storage and Publishing
Active Data, Published Data
Stanford et al The evolution of standards and data
management practices in systems biology, Molecular
Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053
Type specific archives
Fragmented silos
Experimental context
All together
Catalogues, Storage and Publishing
Active Data, Published Data
Central Repository Centric
Research Infrastructure for FAIR
Data for Life Sciences in Europe
Top down, 21 National Nodes +
EMBL-EBI
Project Centric
FAIR Research Data Management for
Data, SOPs, Models for Systems and
Synthetic Biology Projects
Grass roots Association of Institutions
and members funded by 4 EU countries.
http://www.fair-dom.org http://www.elixir-europe.org
modellers
experimentalists
FAIRDOM Consortium
Since 2008….
ERANets and ERACoFunds
National Programmes
National Centres
EU Research Infrastructures Sponsors:
Built by Project PALs
post docs, postgrads and techs
FAIRDOM
FAIRDOM Software Platform+Tools
A Central Public Hub
for Projects
Customised Project
Installations
Project Stewardship
Consultancy Services
Community
Activities
80 Projects 30+ Installations
http://fair-dom.org/knowledgehub/data-management-checklist/
https://dmponline.dcc.ac.uk/
http://dmp.fairdata.solutions/ (very early alpha)
FAIR Checklists
Making Data Findable (documentation and metadata management)
• What documentation and metadata will accompany the data (assist its
discoverability)? (Details on methodology, definitions, procedures, SOPs,
vocabularies, units, dependencies, etc)
• What information is needed for the data to be read and interpreted in the
future?
• What naming conventions will be used?
• How will you approach versioning your data?
• How will you capture / create this documentation and metadata?
• How do you ensure the completeness of the captured data?
Making Data Accessible
Specify which data will be made openly available taking into consideration
• What ethics and legal compliance issues do you have if any? Do you need
consent for data preservation and sharing? Do you have to protect certain
data? Is any data sensitive?
• Do you think you might have Intellectual Property Rights issues? Have you
considered ownership of the data, licensing, restrictions on use?
• Do you think you will need to embargo any data?
• How will you make the data available? (consider the platforms you will use:
databases, repositories, etc)
• What methods or software tools are needed to access the data? shoudl you
include documentation detailing how to access use/access the software that is
needed for accessing the data? Is it possible to include this software with the
data (e.g. source code, docker etc)
• If there are any restrictions on accessibility, how will you provide access?
Making Data Interoperable
• What standards (metadata vocabularies, formats,
checklists) or methodologies will you use?
• How do you address data and model quality?What
validation steps do you foresee?
• Will you use standardised vocabulary for all data types
to allow inter-disciplinary interoperability?
• Where you can not used standardised vocabulary for all
types of data, can you map to more commonly used
ontologies?
Making Data Re-usable
• How will you licence your data to permit the widest re-
use possible?
• When will the data be made available for re-use? Does
this include an embargo period? (if so, why?)
• Which data will be available for re-use during/after the
project? If not, why?
• What are your data quality assurance processes?
• How long do you expect your data to remain re-usable?
Community Actions
http://www.fair-dom.org
Samples Club Developers Club
Stewardship Support
500K needed*, a new career needing a career path
*European Open Science Cloud Report
FAIRDOM Platform
Free and Open Source
Front end
Project(s) Hub
Back end
Onsite storage & analytics
On site
Tracking, data analytic pipelines,
Extract,Transform and Load direct from the
instruments, large data management
LIMS, auto-archiving
Web-based portal
Project controlled spaces
Metadata catalogue &Yellow pages
Results repository, dissemination and collaboration
Tool gateway
Built using Built using
Back end
Instrument Data Management, LIMS, ELN
Samples
Protocols
Experiment
Description
Raw Data
Analysis
Scripts
Results
Laboratory Notebook &
Inventory Manager
ELN
LIMS-like
linking data to biological materials
• samples+protocols management
• data management
• experimental description
Big Data analytics on distributed compute resources
• Project controlled protected spaces
– Working space, show space for results
– Supp. materials space for publications
– Yellow pages and collaboration
– Upload or link to data
• One place catalogue
– Regardless of physical store
– Organised is ISA with shared metadata
– Standards-compliant
• Linked with other systems
– Project on-site (secure) repositories
– Public deposition archives
– Integrated with JWSOnline modelling tools
Front End Hub: A Commons
one place to Find, Access and organise assets
“Using FAIRDOMHub my own
lab colleagues saw what I was
doing and called to
collaborate!”
859 people
80 projects
198 institutions
FAIRDOMHub.org Public Commons
self managed workspaces, controlled sharing, shared metadata
yellow pages
More than simple supplementary materials
16 datafiles (kinetic, flux inhibition, runout)
19 models (kinetics, validation)
13 SOPs
3 studies (model analysis, construction,
validation)
24 assays/analyses (simulations, model
characterisations)
Penkler, G., du Toit, F., Adams, W., Rautenbach, M.,
Palm, D. C., van Niekerk, D. D. and Snoep, J. L. (2015),
Construction and validation of a detailed kinetic model
of glycolysis in Plasmodium falciparum. FEBS J, 282:
1481–1511. doi:10.1111/febs.13237
Investigation
Study Analysis
Data
Model
SOP(Assay)
https://fairdomhub.org/investigations/56
Catalogue across repositories regardless of location
federated stores retaining context to support decision making and reuse
bridging local and global
In House Stores
External Databases
Publishing services
Secure Stores
Model Resources
Upload or
Reference
Protected spaces, sharing sensitivities
Open science applies to you but not me…
not available, not citable.
Licenses
Negotiated access
Embargos
Permission controls
Staged sharing
Act LocalThink Global Cloud Service
.org
Local retention
In flight management,
Private sharing
Customisation
Centres, large projects
National projects
Local skills for admin support
Post-project retention
One stop showcase
Self-managed sharing
Supplementary materials
Off-the-shelf features
Hosted on behalf of users
Delegated admin support
Long term repository
• Trusted
repository
• Guaranteed
until 2029
• Long term
maintenance
• Sustainability
• 1TB per
project stored
centrally.
• Much more
catalogued.
Hub common space, one place
to organise and report your assets
.org
Nucl. Acids Res. (2016) doi: 10.1093/nar/gkw1032
70+ Projects
30+ Installations
Public & cloud
Subject and Datatype archives
Typical Data Flows
HTP data
processing
management
exchange
deposition
publishing
reporting
ORGANISATION
COMMUNICATION
samples
analytics
models, SOPs
processed
data
DISSEMINATION
Less data, more metadata, potentially wider access
processed
data
Publishing…snapshot and assign DOIs
Credits and Citations
G. Penkler, F. Du Toit, W. Adams, M.
Rautenbach, D. C. Palm, D. D. Van
Niekerk, & J. L. Snoep. (2014).
Glucose metabolism in Plasmodium
falciparum trophozoites. FAIRDOMHub.
http://doi.org/10.15490/seek.1.investigati
on.56
Snapshot to fix state with
particular versions
Assign a DOI
Entry has citation metadata
Use in journals and in metrics
systems
Active entry continues to
evolve
Fenner et al, A Data Citation Roadmap for Scholarly Data Repositories
doi: https://doi.org/10.1101/097196
18/07/2017 44
An “evolving manuscript” would begin with a pre-
publication, pre-peer review “beta 0.9” version of an
article, followed by the approved published article itself, [
… ] “version 1.0”.
Subsequently, scientists would update this paper with
details of further work as the area of research develops.
Versions 2.0 and 3.0 might allow for the “accretion of
confirmation [and] reputation”.
Ottoline Leyser […] assessment criteria in science revolve
around the individual. “People have stopped thinking
about the scientific enterprise”.
http://www.timeshighereducation.co.uk/news/evolving-manuscripts-the-future-of-scientific-communication/2020200.article
Retention: Moses
from the ERANET SysMO Programme
Project ended in 2010
Publication in 2014/2015
Using data from 2012
[Maxim Zakhartsev]
[Adapted from Ursula Klingmüller, Martin Böhm]
Excemplify
Antibody
Database
FAIR collaboration
from the ERANet ERASysAPP
47
Programme
Overarching research theme (The Digital Salmon)
Project
Research grant (DigiSal, GenoSysFat)
Investigation
A particular biological process, phenomenon or thing
(typically corresponds to [plans for] one or more closely related
papers)
Study
Experiment whose design reflects a specific biological research
question
Assay
Standardized measurement or diagnostic experiment using a
specific protocol
(applied to material from a study)
Jon Olav Vik,
Norwegian University of Life Science
Integration with Norway’s national
einfrastructure for Life Science (NeLS)
Specialist
databases
Local
Biochem4j
ICE
Global
Brenda,
wikipathways,
Biomodels
ICE
Public
Deposition
Databases
Public
Catalogues
Tracking in
Specialist Systems
Institutional
Catalogue &
Repository
Specialist
databases
Local
Biochem4j
ICE
Global
Brenda,
wikipathways,
Biomodels
ICE
Public
Deposition
Databases
Public
Catalogues
Institutional
Catalogue &
Repository
Tracking in
Specialist Systems
Ubiquitous Spreadsheet
• Unifying processes
• Common spreadsheet models
– Consistency and quality of
collaboration
– Common identifier meanings
– Metadata collection
Tracking in
Specialist Systems
http://www.fairdomhub.org
https://sandbox1.fairdomhub.org
• empty box for safe playing
• copy the investigation that is there
• add your name to the guest list so we don’t double
up - http://tinyurl.com/sandboxlist
Try out for yourself…
The first steps?
• Metadata design
• Samples
– The link between everything
• The ubiquitous spreadsheet
– Templates and exchange…
– Unifying processes
– Carrying best practice
Image from FAIRSharing.org
Use and reuse
standard identifiers
General standards
Site specific
Community standards
e.g. SynBioChem ICE Strain convention
A URL preferably to identifiers.org that resolves to the
description of the host strain in NCBI taxonomy
e.g. e-Coli DH5α http://identifiers.org/taxonomy/668369
location independent resolvable
identifiers (URIs) decoupling the
identification of records from
their physical locations
Investigation:
Glucose metabolism in P.
falciparum trophozoites
Study:
Model construction
Study:
Model validation
Assay: LDH
Assay: PK
Assay: ENO
Assay: PGM
Assay: PGK
Assay: GAPDH
Assay: TPI
Assay: ALD
Assay: PFK
Assay: PGI
Assay: HK
Assay: GLCtr
Assay: PYRtr
Assay: LACtr
Assay: G3PDH
Assay: GLYtr
Assay: ATPase
Data: GLCtr
Model: GLCtr
Data: HK
Model: HK
Steady state
Incubation
penkler1
Validation data
penkler2
Validation data
...
...
SOP: GLCtr
SOP: HK
...
SOP: Validation
Assay: Culturing
Assay: Lysate prep.
SOP: Culturing
SOP: Lysate prep.
Design an ISA (Investigation, Study,
Assay/Analysis) structure.
Devising this
makes you
think…..
Use FAIRData and Metadata Standards
help to improve understanding and exchange….
Credit: Nicolas Le Novère, Babraham Institute, UK, adapted.
represents
genetic designs
- standardized
vocabulary of
schematic glyphs
- standardized
digital format.
ICE, SBOLStack,
iGEM
CIMR Core Information for Metabolomics Reporting
MIABE Minimal Information About a Bioactive Entity
MIACA Minimal Information About a Cellular Assay
MIAME Minimum Information About a Microarray Experiment
MIAME/Nutr MIAME / Nutrigenomics
MIAME/Plant MIAME / Plant transcriptomics
MIAME/Tox MIAME /Toxicogenomics
MIAPA Minimum Information About a Phylogenetic Analysis
MIAPAR Minimum Information About a Protein Affinity Reagent
MIAPE Minimum Information About a Proteomics Experiment
MIARE Minimum Information About a RNAi Experiment
MIASE Minimum Information About a Simulation Experiment
Where do I go for standards
information?
Linking models….
• connecting (experimental/simulation) data to models
• connecting the single standards?
• interfacing between the different scales?
https://fairsharing.org/collection/FAIRDOM
How do I design the metadata?
Metadata ramps
Metadata
Registration and Use
Metadata ramps: spreadsheet templates
Tooling for annotations and checklist templates for different types of assay data.
Embed ontologies into
Excel templates
Excel spreadsheets enriched
with ontology annotations
Upload, extract metadata and register
http://www.rightfield.org.uk
Ramping up Samples
Spreadsheets! A new framework for Syn and Sys Bio
Samples are Inputs and Outputs….
compliant
Sounds
hard….
what can
I do?
12 steps to being FAIR
plan to be born FAIR
1. plan data management lifecycle: plan, cost and implement pathways and
storage including what you will archive, what you will throw away, how you
will collect metadata and how you will curate throughout
2. use standard identifiers and identifier standards
3. use metadata standards with data provenance
4. catalogue / register data with metadata
5. have access and sharing policies with licenses
6. use data (assets) management platforms and tools that work together
7. deposit into public archives
8. have a sustainability / end project plan
9. resource and support, and that also means people too
10. embed data management into work practices and do some training
11. give credit
12. check if you have sensitive data issues
What can you do?
• Make a Data Management Plan (check the checklist).
• Get an account on the FAIRDOMHub or install your own.
• Define and share your SOPs.
• Who is your group’s data steward?
• How are they getting credit?
• Know your local data management policies and resources.
• Get some training.
• Educate your supervisors, institutions and peers.
• Build some metadata ramps
The Data Steward
function, profession, cultural shift
• 500,000 needed in Europe*
• Specialist skills
• Career pathways
• Recognition
Curation and management
• Supported, Resourced
• Recognised, Rewarded
Sharing policy and practice embedded
* Realising the Open European Science Cloud (2016)
Jon OlavVik,
Norwegian University of Life Science
Maksim Zakhartsev
University Hohenheim, Stuttgart,
Germany
Alexey Kolodkin
Siberian Branch
Russian Academy of Sciences
Tomasz Zieliński,
SynthSys Centre
University Edinburgh, UK
Martin Peters, Martin Scharm
Systems Biology Bioinformatics
University of Rostock, Germany
Reading List
• Wolstencroft et al (2016). “FAIRDOMHub: a repository and collaboration environment
for sharing systems biology research”. NucleicAcids Research, 45(D1): D404-D407.
DOI: 10.1093/nar/gkw1032
• Rice and Southal,The Data Librarian's Handbook, Wiley Publishing, 2016
• Stanford et alThe evolution of standards and data management practices in systems
biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053
• Wilkinson et alThe FAIR Guiding Principles for scientific data management and
stewardship, https://www.nature.com/articles/sdata201618 (2016)
• McMurry, Juty, et al. (2017) Identifiers for the 21st century: How to design, provision,
and reuse persistent identifiers to maximize utility and impact of life science data.
PLoS Biol 15(6): e2001414. https://doi.org/10.1371/journal.pbio.2001414
• Fenner et al, A Data Citation Roadmap for Scholarly Data Repositories doi:
https://doi.org/10.1101/097196
• Realising the Open European ScienceCloud
https://ec.europa.eu/research/openscience/pdf/realising_the_european_open_science
_cloud_2016.pdf
Website list
• FAIRDOM http://www.fair-dom.org
• FAIRDOMHub http://www.fairdomhub.org
• Rightfield http://www.rightfield.org.uk
• FAIRSharing http://www.fairsharing.org
• ELIXIR http://www.elixir-europe.org
• Software Carpentry https://software-carpentry.org/
• DataCarpentry http://www.datacarpentry.org/
• Sandbox https://sandbox1.fairdomhub.org
• empty box for safe playing
• copy the investigation that is there
• add your name to the guest list so we don’t double up -
http://tinyurl.com/sandboxlist

Being FAIR: FAIR data and model management SSBSS 2017 Summer School

  • 1.
    Being FAIR: FAIR dataand model management Professor Carole Goble, carole.goble@manchester.ac.uk The University of Manchester, UK The FAIRDOM Association Coordinator ELIXIR-UK Head of Node Co-lead ELIXIR Interoperability Platform SSBSS 2017, July 17 2017, Cambridge, UK 4th International Synthetic & Systems Biology Summer School
  • 2.
    Data-driven and predictivebiology Data, Software, Models, SOPs….MATTER Not a by-product. It’s the fuel. The assets. modellers experimentalists
  • 3.
  • 4.
    SystemsApproach: Context +more than Data models, data, SOPs, samples, strains, publications…. multiple, interrelated assets. multiple, dispersed repositories Multiple omics: genomics, transcriptomics proteomics, metabolomics, fluxomics, reactomics Images, molecular biology, reaction kinetics… SOPs, sample and strain metadata… Models: Metabolic, gene network, kinetic… Scripts and workflows The relationships between… Tracking: versions, provenance, parameters… Citation and credit… Standards fairsharing.org
  • 5.
    More than simplesupplementary materials 16 datafiles (kinetic, flux inhibition, runout) 19 models (kinetics, validation) 13 SOPs 3 studies (model analysis, construction, validation) 24 assays/analyses (simulations, model characterisations) Penkler, G., du Toit, F., Adams, W., Rautenbach, M., Palm, D. C., van Niekerk, D. D. and Snoep, J. L. (2015), Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum. FEBS J, 282: 1481–1511. doi:10.1111/febs.13237
  • 6.
    SyntheticApproach - Automation •Automate data management – Spreadsheets, instruments, LIMS… – Replication, comparison • Support automation – Tracking successful products from plasmids – Informing robots – Incorporate into pipelines and workflows – Mediate through samples – Standards [Courtesy: Andrew Millar]
  • 7.
    Systems Approach…Collaboration teams, disciplines,partners What methods are been used to determine enzyme activity? What SOP was used for this sample? Where is the validation data for this model? Is there any group generating kinetic data? Is this data available? Track versions of my model Whats the relationship between the data and model? Which data belong to which publications?
  • 8.
  • 9.
    End to endManagement Project Boot up, Run andWashup • Capture • Track • Organise & Link • Curate • Report • Exchange • Retain • Integrate • Reuse other systems • Support data-driven processes CREATING DATA PROCESSING DATA ANALYSING DATA PRESERVING DATA ACCESS TO DATA RE-USING DATA
  • 10.
    The FAIR GuidingPrinciples for scientific data management and stewardship https://www.nature.com/articles/sdata201618 (2016)
  • 11.
    The greater good…. Accessto public funded research, Reproducible results Value and cite all research outputs https://www.nature.com/articles/sdata201618 (2016)
  • 12.
    UK Funder DataPolicies http://www.dcc.ac.uk/resources/policy-and-legal/overview-funders-data-policies Compliance and Policy Data Management Plans
  • 13.
    https://wellcomeopenresearch.org/ Nature ScientificData Data (and software) as a first class citizen Data (and software) Citation Scholarly Communications Providers
  • 14.
    The Personal good…. •reviewers want additional work • statistician wants more runs • analysis needs to be repeated • post-doc leaves, • student arrives • new/revised datasets • updated/new versions of algorithms/codes • sample was contaminated • better kit - longer simulations • new partners, new projects Personal & Lab Productivity Sharing Reproducibility
  • 15.
  • 16.
    Standards are notalways used.... Formats MetadataMetadata reporting guidelines Ontologies *top three most popular The evolution of standards and data management practices in systems biology (2015). Stanford et al, Molecular Systems Biology, 11(12):851
  • 17.
    … model reuseand reproducibility tricky… Stanford et alThe evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053
  • 18.
    Catalogues, Storage andPublishing Active Data, Published Data local/project LIMS, data management, analytics. active data. global, public, central subject-specific databases published data. ACT LOCALTHINK GLOBAL
  • 19.
    Cloud services figshare zenodo Amazon WebServices Google Cloud Azure, EBI Embassy Cloud Own cloud FAIRDOMHub mendeley data Cloud Data Services Cloud hosting services OpenAIRE
  • 20.
    Catalogues, Storage andPublishing Active Data, Published Data Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053
  • 21.
    Catalogues, Storage andPublishing Active Data, Published Data Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053 Type specific archives Fragmented silos
  • 22.
    Catalogues, Storage andPublishing Active Data, Published Data Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053 Type specific archives Fragmented silos Experimental context All together
  • 23.
    Catalogues, Storage andPublishing Active Data, Published Data Central Repository Centric Research Infrastructure for FAIR Data for Life Sciences in Europe Top down, 21 National Nodes + EMBL-EBI Project Centric FAIR Research Data Management for Data, SOPs, Models for Systems and Synthetic Biology Projects Grass roots Association of Institutions and members funded by 4 EU countries. http://www.fair-dom.org http://www.elixir-europe.org
  • 24.
  • 25.
    FAIRDOM Consortium Since 2008…. ERANetsand ERACoFunds National Programmes National Centres EU Research Infrastructures Sponsors:
  • 26.
    Built by ProjectPALs post docs, postgrads and techs
  • 27.
    FAIRDOM FAIRDOM Software Platform+Tools ACentral Public Hub for Projects Customised Project Installations Project Stewardship Consultancy Services Community Activities 80 Projects 30+ Installations
  • 28.
  • 29.
    FAIR Checklists Making DataFindable (documentation and metadata management) • What documentation and metadata will accompany the data (assist its discoverability)? (Details on methodology, definitions, procedures, SOPs, vocabularies, units, dependencies, etc) • What information is needed for the data to be read and interpreted in the future? • What naming conventions will be used? • How will you approach versioning your data? • How will you capture / create this documentation and metadata? • How do you ensure the completeness of the captured data? Making Data Accessible Specify which data will be made openly available taking into consideration • What ethics and legal compliance issues do you have if any? Do you need consent for data preservation and sharing? Do you have to protect certain data? Is any data sensitive? • Do you think you might have Intellectual Property Rights issues? Have you considered ownership of the data, licensing, restrictions on use? • Do you think you will need to embargo any data? • How will you make the data available? (consider the platforms you will use: databases, repositories, etc) • What methods or software tools are needed to access the data? shoudl you include documentation detailing how to access use/access the software that is needed for accessing the data? Is it possible to include this software with the data (e.g. source code, docker etc) • If there are any restrictions on accessibility, how will you provide access? Making Data Interoperable • What standards (metadata vocabularies, formats, checklists) or methodologies will you use? • How do you address data and model quality?What validation steps do you foresee? • Will you use standardised vocabulary for all data types to allow inter-disciplinary interoperability? • Where you can not used standardised vocabulary for all types of data, can you map to more commonly used ontologies? Making Data Re-usable • How will you licence your data to permit the widest re- use possible? • When will the data be made available for re-use? Does this include an embargo period? (if so, why?) • Which data will be available for re-use during/after the project? If not, why? • What are your data quality assurance processes? • How long do you expect your data to remain re-usable?
  • 30.
  • 31.
    Stewardship Support 500K needed*,a new career needing a career path *European Open Science Cloud Report
  • 32.
    FAIRDOM Platform Free andOpen Source Front end Project(s) Hub Back end Onsite storage & analytics On site Tracking, data analytic pipelines, Extract,Transform and Load direct from the instruments, large data management LIMS, auto-archiving Web-based portal Project controlled spaces Metadata catalogue &Yellow pages Results repository, dissemination and collaboration Tool gateway Built using Built using
  • 33.
    Back end Instrument DataManagement, LIMS, ELN Samples Protocols Experiment Description Raw Data Analysis Scripts Results Laboratory Notebook & Inventory Manager ELN LIMS-like linking data to biological materials • samples+protocols management • data management • experimental description Big Data analytics on distributed compute resources
  • 34.
    • Project controlledprotected spaces – Working space, show space for results – Supp. materials space for publications – Yellow pages and collaboration – Upload or link to data • One place catalogue – Regardless of physical store – Organised is ISA with shared metadata – Standards-compliant • Linked with other systems – Project on-site (secure) repositories – Public deposition archives – Integrated with JWSOnline modelling tools Front End Hub: A Commons one place to Find, Access and organise assets “Using FAIRDOMHub my own lab colleagues saw what I was doing and called to collaborate!”
  • 35.
    859 people 80 projects 198institutions FAIRDOMHub.org Public Commons self managed workspaces, controlled sharing, shared metadata yellow pages
  • 36.
    More than simplesupplementary materials 16 datafiles (kinetic, flux inhibition, runout) 19 models (kinetics, validation) 13 SOPs 3 studies (model analysis, construction, validation) 24 assays/analyses (simulations, model characterisations) Penkler, G., du Toit, F., Adams, W., Rautenbach, M., Palm, D. C., van Niekerk, D. D. and Snoep, J. L. (2015), Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum. FEBS J, 282: 1481–1511. doi:10.1111/febs.13237
  • 37.
  • 38.
    Catalogue across repositoriesregardless of location federated stores retaining context to support decision making and reuse bridging local and global In House Stores External Databases Publishing services Secure Stores Model Resources Upload or Reference
  • 39.
    Protected spaces, sharingsensitivities Open science applies to you but not me… not available, not citable. Licenses Negotiated access Embargos Permission controls Staged sharing
  • 40.
    Act LocalThink GlobalCloud Service .org Local retention In flight management, Private sharing Customisation Centres, large projects National projects Local skills for admin support Post-project retention One stop showcase Self-managed sharing Supplementary materials Off-the-shelf features Hosted on behalf of users Delegated admin support Long term repository • Trusted repository • Guaranteed until 2029 • Long term maintenance • Sustainability • 1TB per project stored centrally. • Much more catalogued.
  • 41.
    Hub common space,one place to organise and report your assets .org Nucl. Acids Res. (2016) doi: 10.1093/nar/gkw1032 70+ Projects 30+ Installations Public & cloud Subject and Datatype archives
  • 42.
    Typical Data Flows HTPdata processing management exchange deposition publishing reporting ORGANISATION COMMUNICATION samples analytics models, SOPs processed data DISSEMINATION Less data, more metadata, potentially wider access processed data
  • 43.
    Publishing…snapshot and assignDOIs Credits and Citations G. Penkler, F. Du Toit, W. Adams, M. Rautenbach, D. C. Palm, D. D. Van Niekerk, & J. L. Snoep. (2014). Glucose metabolism in Plasmodium falciparum trophozoites. FAIRDOMHub. http://doi.org/10.15490/seek.1.investigati on.56 Snapshot to fix state with particular versions Assign a DOI Entry has citation metadata Use in journals and in metrics systems Active entry continues to evolve Fenner et al, A Data Citation Roadmap for Scholarly Data Repositories doi: https://doi.org/10.1101/097196
  • 44.
    18/07/2017 44 An “evolvingmanuscript” would begin with a pre- publication, pre-peer review “beta 0.9” version of an article, followed by the approved published article itself, [ … ] “version 1.0”. Subsequently, scientists would update this paper with details of further work as the area of research develops. Versions 2.0 and 3.0 might allow for the “accretion of confirmation [and] reputation”. Ottoline Leyser […] assessment criteria in science revolve around the individual. “People have stopped thinking about the scientific enterprise”. http://www.timeshighereducation.co.uk/news/evolving-manuscripts-the-future-of-scientific-communication/2020200.article
  • 45.
    Retention: Moses from theERANET SysMO Programme Project ended in 2010 Publication in 2014/2015 Using data from 2012 [Maxim Zakhartsev]
  • 46.
    [Adapted from UrsulaKlingmüller, Martin Böhm] Excemplify Antibody Database FAIR collaboration from the ERANet ERASysAPP
  • 47.
    47 Programme Overarching research theme(The Digital Salmon) Project Research grant (DigiSal, GenoSysFat) Investigation A particular biological process, phenomenon or thing (typically corresponds to [plans for] one or more closely related papers) Study Experiment whose design reflects a specific biological research question Assay Standardized measurement or diagnostic experiment using a specific protocol (applied to material from a study) Jon Olav Vik, Norwegian University of Life Science Integration with Norway’s national einfrastructure for Life Science (NeLS)
  • 48.
  • 49.
  • 50.
    Ubiquitous Spreadsheet • Unifyingprocesses • Common spreadsheet models – Consistency and quality of collaboration – Common identifier meanings – Metadata collection Tracking in Specialist Systems
  • 51.
    http://www.fairdomhub.org https://sandbox1.fairdomhub.org • empty boxfor safe playing • copy the investigation that is there • add your name to the guest list so we don’t double up - http://tinyurl.com/sandboxlist Try out for yourself…
  • 52.
    The first steps? •Metadata design • Samples – The link between everything • The ubiquitous spreadsheet – Templates and exchange… – Unifying processes – Carrying best practice Image from FAIRSharing.org
  • 53.
    Use and reuse standardidentifiers General standards Site specific Community standards e.g. SynBioChem ICE Strain convention A URL preferably to identifiers.org that resolves to the description of the host strain in NCBI taxonomy e.g. e-Coli DH5α http://identifiers.org/taxonomy/668369 location independent resolvable identifiers (URIs) decoupling the identification of records from their physical locations
  • 54.
    Investigation: Glucose metabolism inP. falciparum trophozoites Study: Model construction Study: Model validation Assay: LDH Assay: PK Assay: ENO Assay: PGM Assay: PGK Assay: GAPDH Assay: TPI Assay: ALD Assay: PFK Assay: PGI Assay: HK Assay: GLCtr Assay: PYRtr Assay: LACtr Assay: G3PDH Assay: GLYtr Assay: ATPase Data: GLCtr Model: GLCtr Data: HK Model: HK Steady state Incubation penkler1 Validation data penkler2 Validation data ... ... SOP: GLCtr SOP: HK ... SOP: Validation Assay: Culturing Assay: Lysate prep. SOP: Culturing SOP: Lysate prep. Design an ISA (Investigation, Study, Assay/Analysis) structure. Devising this makes you think…..
  • 55.
    Use FAIRData andMetadata Standards help to improve understanding and exchange…. Credit: Nicolas Le Novère, Babraham Institute, UK, adapted. represents genetic designs - standardized vocabulary of schematic glyphs - standardized digital format. ICE, SBOLStack, iGEM CIMR Core Information for Metabolomics Reporting MIABE Minimal Information About a Bioactive Entity MIACA Minimal Information About a Cellular Assay MIAME Minimum Information About a Microarray Experiment MIAME/Nutr MIAME / Nutrigenomics MIAME/Plant MIAME / Plant transcriptomics MIAME/Tox MIAME /Toxicogenomics MIAPA Minimum Information About a Phylogenetic Analysis MIAPAR Minimum Information About a Protein Affinity Reagent MIAPE Minimum Information About a Proteomics Experiment MIARE Minimum Information About a RNAi Experiment MIASE Minimum Information About a Simulation Experiment
  • 56.
    Where do Igo for standards information? Linking models…. • connecting (experimental/simulation) data to models • connecting the single standards? • interfacing between the different scales? https://fairsharing.org/collection/FAIRDOM
  • 57.
    How do Idesign the metadata? Metadata ramps Metadata Registration and Use
  • 58.
    Metadata ramps: spreadsheettemplates Tooling for annotations and checklist templates for different types of assay data. Embed ontologies into Excel templates Excel spreadsheets enriched with ontology annotations Upload, extract metadata and register http://www.rightfield.org.uk
  • 59.
    Ramping up Samples Spreadsheets!A new framework for Syn and Sys Bio Samples are Inputs and Outputs…. compliant
  • 60.
    Sounds hard…. what can I do? 12steps to being FAIR plan to be born FAIR 1. plan data management lifecycle: plan, cost and implement pathways and storage including what you will archive, what you will throw away, how you will collect metadata and how you will curate throughout 2. use standard identifiers and identifier standards 3. use metadata standards with data provenance 4. catalogue / register data with metadata 5. have access and sharing policies with licenses 6. use data (assets) management platforms and tools that work together 7. deposit into public archives 8. have a sustainability / end project plan 9. resource and support, and that also means people too 10. embed data management into work practices and do some training 11. give credit 12. check if you have sensitive data issues
  • 61.
    What can youdo? • Make a Data Management Plan (check the checklist). • Get an account on the FAIRDOMHub or install your own. • Define and share your SOPs. • Who is your group’s data steward? • How are they getting credit? • Know your local data management policies and resources. • Get some training. • Educate your supervisors, institutions and peers. • Build some metadata ramps
  • 62.
    The Data Steward function,profession, cultural shift • 500,000 needed in Europe* • Specialist skills • Career pathways • Recognition Curation and management • Supported, Resourced • Recognised, Rewarded Sharing policy and practice embedded * Realising the Open European Science Cloud (2016)
  • 63.
    Jon OlavVik, Norwegian Universityof Life Science Maksim Zakhartsev University Hohenheim, Stuttgart, Germany Alexey Kolodkin Siberian Branch Russian Academy of Sciences Tomasz Zieliński, SynthSys Centre University Edinburgh, UK Martin Peters, Martin Scharm Systems Biology Bioinformatics University of Rostock, Germany
  • 64.
    Reading List • Wolstencroftet al (2016). “FAIRDOMHub: a repository and collaboration environment for sharing systems biology research”. NucleicAcids Research, 45(D1): D404-D407. DOI: 10.1093/nar/gkw1032 • Rice and Southal,The Data Librarian's Handbook, Wiley Publishing, 2016 • Stanford et alThe evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053 • Wilkinson et alThe FAIR Guiding Principles for scientific data management and stewardship, https://www.nature.com/articles/sdata201618 (2016) • McMurry, Juty, et al. (2017) Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data. PLoS Biol 15(6): e2001414. https://doi.org/10.1371/journal.pbio.2001414 • Fenner et al, A Data Citation Roadmap for Scholarly Data Repositories doi: https://doi.org/10.1101/097196 • Realising the Open European ScienceCloud https://ec.europa.eu/research/openscience/pdf/realising_the_european_open_science _cloud_2016.pdf
  • 65.
    Website list • FAIRDOMhttp://www.fair-dom.org • FAIRDOMHub http://www.fairdomhub.org • Rightfield http://www.rightfield.org.uk • FAIRSharing http://www.fairsharing.org • ELIXIR http://www.elixir-europe.org • Software Carpentry https://software-carpentry.org/ • DataCarpentry http://www.datacarpentry.org/ • Sandbox https://sandbox1.fairdomhub.org • empty box for safe playing • copy the investigation that is there • add your name to the guest list so we don’t double up - http://tinyurl.com/sandboxlist