FAIR data for humans and machines:
The FAIRplus project
Susanna-Assunta Sansone
ORCiD: 0000-0001-5306-5690 | Twitter: @SusannaASansone
datareadiness.eng.ox.ac.uk
Associate Professor, Information Engineering
Associate Director, Oxford e-Research Centre
MCBIOS & MAQC joint virtual conference, 29-30 April 2021
Slides: https://www.slideshare.net/SusannaSansone
Discoveries are made using shared data and this requires data that are:
• Retrievable and structured in standard format(s)
• Self-described so that third parties can make sense of it
The problem
Forbes article on 2016 Data Scientist Report
https://www.forbes.com/sites/gilpress/2016/03/23/data-
preparation-most-time-consuming-least-enjoyable-data-science-
task-survey-says/#276a35e6f637
Data preparation accounts for about 80% of the work of data scientists
A set of principles to enhance the
value of all digital resources and its
reuse by humans and machines
Data that is discoverable and reusable at scale
Findable
Accessible
Interoperable
Reusable
• Globally unique, resolvable, and persistent identifiers
▪ To retrieve and connect data
• Community defined descriptive metadata
▪ To enhance discoverability
• Common terminologies
▪ To use the same term mean the same thing
• Detailed provenance
▪ To contextualize the data and facilitate reproducibility
• Terms of access
▪ Open as possible, closed as necessary
• Terms of use
▪ Clear licences, ideally to enable innovation and reuse
The FAIR Principles in a nutshell
Findable
Accessible
Interoperable
Reusable
Providing for a continuum of features,
attributes and behaviours
FAIR guiding principles
FAIR: just principles, not practice
The scholarly publishing
ecosystem is changing
Data-relates mandates by
funders and institutions are
growing
Researchers need
recognition and credit
theconversation.com/how-robots-can-help-us-embrace-a-more-human-view-of-disability-76815
Human-machine collaboration is the future
o 21% pharmacology data (doi.org/10.1038/nrd3439-c1)
o 11% cancer data (doi.org/10.1038/483531a)
o unsatisfactory in ML (openreview.net/pdf?id=By4l2PbQ-)
towardsdatascience.com/scientific-data-analysis-pipelines-and-reproducibility-75ff9df5b4c5
Reproducibility of published studies is still problematic
Responding to needs and crisis
doi.org/10.2777/986252
www.gov.uk/government/publications/open-
research-data-task-force-final-report
www.turing.ac.uk/research/impact-
stories/changing-culture-data-science
www.fair-access.net.au
doi.org/10.1787/25186167
ark:/48223/pf0000374837
FAIR has aligned the broad community
around common guidelines
doi.org/10.7486/DRI.tq582c863
doi.org/10.2777/02999
The cost of not having FAIR research data
Impact on innovation
Credit to:
A crowded space, examples of European projects
Credit to:
A crowded space, examples of European projects
Examples:
A growing number of metrics, indicators,
certifications of FAIRness
Diversity of methods and opinions:
• Metrics and indicators
• Automated and manual
Define Implement Embed & Sustain
Concepts for FAIR
implementation
FAIR culture
FAIR
ecosystem
Skills for FAIR
Incentives and
metrics for FAIR data
and services
Investment in
FAIR
Economic Technical Social Political
doi.org/10.2777/1524
Making FAIR a reality in the research ecosystem
My fair share
of the work
€5.3 billion
programme
European
intergovernmental
organization
23 member
countries and over
220 research
organizations
Since 2014
1
2
3 Started in 2019
Since in 2014, several programs:
2014-2017
2017-2018
Example of FAIR-enabling programs and projects
Since 2009
Pre-competitive collaboration between over
100 global organizations, SMEs and academia
The way biopharma works has changed
Today & tomorrow
Proprietary
content
providers
Public
content
providers
Academic
groups
Software vendors
CROs
Service providers
Regulatory
authorities
Exemplar initiatives:
The rise of pre-competitive initiatives around data
• Biopharma R&D productivity can be improved
by implementing the FAIR Principles
• FAIR enables powerful new AI analytics to
access data for machine learning and prediction
Ø Requirements
§ financial, technical, training
Ø Challenges
§ change the culture, show business value,
achieve the ‘FAIR enough’ on an enterprise scale
FAIR as enabler for the digital transformation
● No practical advice on how to do FAIRification
● Can’t tell how FAIR I am already
● Don’t know how to become more FAIR
● My organisation doesn’t care anyway
Deliver the FAIR Cookbook
Mature FAIRification processes & build a maturity model
Assess FAIR levels of projects and data
Change data management culture
Objectives and outputs of the FAIRplus project
Defined by using a number of IMI health research and innovation projects
The FAIRification process
20%
identifiers
80%
metadata
https://doi.org/10.2777/1524
Two pillars of FAIR
Findable
Accessible
Interoperable
Reusable
22
Different contexts mandate different strategies
Molecular data
Clinical (observation based) data Clinical trial (event based) data
FAIRification paths in the context of IMI projects
23
Molecular data
Clinical (observation based) data Clinical trial (event based) data
Selecting standard stacks for the FAIRification
24
Clinical (observation based) data Clinical trial (event based) data
Selecting standard stacks for the FAIRification
25
Terminology
Format
Checklists
Molecular data
Selecting standard stacks for the FAIRification
390+
162+
729+
~1300
13
MIAME
MIRIAM
MIQAS
MIX
MIGEN
ARRIVE
MIAPE
MIASE
MIQE
MISFISHIE
….
REMARK
CONSORT
SRAxml
SOFT FASTA
DICOM
MzML
SBRML
SEDML
…
GELML
ISA
CML
MITAB
…
AAO
CHEBI
OBI
PATO ENVO
MOD
BTO
IDO
…
TEDDY
PRO
XAO
DO
VO
EC number
URL
PURL
LSID
Handle
ORCID
RRID
InChI
…
IVOA ID
DOI
standard
organizations
grass-roots
groups
Formats Terminologies Guidelines Identifiers
ID
COMMUNITY STANDARDS
for metadata and identifiers
DATA & METADATA STANDARDS
REPOSITORIES
databases and
knowledgebases
DATA POLICIES
by funders, journals and
other organizations
an informative and educational resource
Guides consumers to discover, select and use these
resources with confidence
Helps producers to make their resources more visible,
more widely adopted and cited
DATA & METADATA STANDARDS
REPOSITORIES
databases and
knowledgebases
DATA POLICIES
by funders, journals and
other organizations
Provides curated, community-vetted
descriptions and knowledge graphs that
represent these resources and their inter-relationships
an informative and educational resource
… 232 standards
https://fairsharing.org/collection/ISOCD20691CollectionDRAFT
ISO/CD 20691 specification that details the requirements for
the consistent formatting and documentation of data and
metadata in the life sciences and biotechnology, including
biomedical research and non-human biological research and
development; it covers manual or computational workflows.
This FAIRsharing Collection includes the standards detailed in
the ISO/CD 20691 specification, and serves as a 'live' list to
search and discover these standards, their use by repositories,
as well as their evolution over time.
Publishing the FAIRified data and turning knowledge
into recipes
https://fairplus.github.io/the-fair-cookbook
● A comprehensive resource collating ‘recipes’ for making different
types of data FAIR
● Work in progress: currently 50 recipes and growing….
What is it?
FAIRification processes
● How to FAIRify or improve the FAIRness of exemplar datasets
● Which are the levels and indicators of FAIRness
● Which open source technologies, tools and services are available
● What skills are required
● Awareness of known challenges
Learning outcomes
33
34
35
https://doi.org/10.1038/s41597-019-0286-0
Example of converting
Excel files to frictionless
data package, and then
build a semantic model to
have a linked data graphs
37
The FAIR Cookbook
38
The FAIR Cookbook
39
Overview
40
Capability Maturity
Model
FAIRsharing
cross linking
Biotools
cross linking
41
Tracking authors and their
contributions with the
CRedIT vocabulary
Open license
42
Executable code as Jupyter notebook
43
https://fairtoolkit.pistoiaalliance.org
Partnership with
1. Ontologies
2. Standards
3. Versioning
4. Identifiers
5. Licensing
Top needs and challenges
FAIRification processes
FAIRification processes
● How to measures the FAIRness level of data?
○ For use in the FAIRification processes to define initial/final level of data FAIRness
● How to measures capability and performance of an organization for FAIR data
generation and management?
○ For use at the strategy level to identify investment areas, monitor processes
○ E.g. ability to provide ETL capability, an ontology look-up service, or mapping services
FAIR indicators and capability maturity model
The capability maturity model
Which capabilities are needed to
improve data reusability?
The optimum level of FAIRness
is a trade-off between desired
data reuse level and cost to
achieve that level
The capability maturity model - the ontology example
Which capabilities are needed to
improve data reusability?
The optimum level of FAIRness
is a trade-off between desired
data reuse level and cost to
achieve that level
No use of
ontologies
Use of internal
ontologies
Use of
community
ontologies
+ Ontology service to
manage several
ontologies, mapping,
versioning etc.
+ Term suggestion,
automatic annotation,
terms conflict
resolution etc.
No use of
ontologies
Use of internal
ontologies
Use of
community
ontologies
+ Ontology service to
manage several
ontologies, mapping,
versioning etc.
+ Term suggestion,
automatic annotation,
terms conflict
resolution etc.
The capability maturity model - the ontology example
Robot recipe will
help to move
from Repeatable
to Defined level
● Practical advice on how to do FAIRification
● Exemplar FAIRified data
● Process and maturity model
● My organisation cares about FAIR data
Deliver the FAIR Cookbook
Mature FAIRification
Projects and datasets
Change data management culture
FAIRplus project - summary
Before FAIR
After FAIR
The road to FAIRness
After FAIR
The road to FAIRness
….from chaos,
comes order?
Before FAIR
infrastructures
standards
tools
policies
education
training
cultural normalization
incentives
long term investment
It is not simple, but it is no longer optional
A FAIRY tale needs some magic
The Magic Roundabout in Swindon, England
datareadiness.eng.ox.ac.uk

FAIR, FAIRplus and the FAIR Cookbook

  • 1.
    FAIR data forhumans and machines: The FAIRplus project Susanna-Assunta Sansone ORCiD: 0000-0001-5306-5690 | Twitter: @SusannaASansone datareadiness.eng.ox.ac.uk Associate Professor, Information Engineering Associate Director, Oxford e-Research Centre MCBIOS & MAQC joint virtual conference, 29-30 April 2021 Slides: https://www.slideshare.net/SusannaSansone
  • 2.
    Discoveries are madeusing shared data and this requires data that are: • Retrievable and structured in standard format(s) • Self-described so that third parties can make sense of it The problem Forbes article on 2016 Data Scientist Report https://www.forbes.com/sites/gilpress/2016/03/23/data- preparation-most-time-consuming-least-enjoyable-data-science- task-survey-says/#276a35e6f637 Data preparation accounts for about 80% of the work of data scientists
  • 3.
    A set ofprinciples to enhance the value of all digital resources and its reuse by humans and machines Data that is discoverable and reusable at scale
  • 4.
    Findable Accessible Interoperable Reusable • Globally unique,resolvable, and persistent identifiers ▪ To retrieve and connect data • Community defined descriptive metadata ▪ To enhance discoverability • Common terminologies ▪ To use the same term mean the same thing • Detailed provenance ▪ To contextualize the data and facilitate reproducibility • Terms of access ▪ Open as possible, closed as necessary • Terms of use ▪ Clear licences, ideally to enable innovation and reuse The FAIR Principles in a nutshell
  • 5.
    Findable Accessible Interoperable Reusable Providing for acontinuum of features, attributes and behaviours FAIR guiding principles
  • 6.
  • 7.
    The scholarly publishing ecosystemis changing Data-relates mandates by funders and institutions are growing Researchers need recognition and credit theconversation.com/how-robots-can-help-us-embrace-a-more-human-view-of-disability-76815 Human-machine collaboration is the future o 21% pharmacology data (doi.org/10.1038/nrd3439-c1) o 11% cancer data (doi.org/10.1038/483531a) o unsatisfactory in ML (openreview.net/pdf?id=By4l2PbQ-) towardsdatascience.com/scientific-data-analysis-pipelines-and-reproducibility-75ff9df5b4c5 Reproducibility of published studies is still problematic Responding to needs and crisis
  • 8.
  • 9.
    doi.org/10.2777/02999 The cost ofnot having FAIR research data Impact on innovation
  • 10.
    Credit to: A crowdedspace, examples of European projects
  • 11.
    Credit to: A crowdedspace, examples of European projects
  • 12.
    Examples: A growing numberof metrics, indicators, certifications of FAIRness Diversity of methods and opinions: • Metrics and indicators • Automated and manual
  • 13.
    Define Implement Embed& Sustain Concepts for FAIR implementation FAIR culture FAIR ecosystem Skills for FAIR Incentives and metrics for FAIR data and services Investment in FAIR Economic Technical Social Political doi.org/10.2777/1524 Making FAIR a reality in the research ecosystem
  • 14.
  • 15.
    €5.3 billion programme European intergovernmental organization 23 member countriesand over 220 research organizations Since 2014 1 2 3 Started in 2019 Since in 2014, several programs: 2014-2017 2017-2018 Example of FAIR-enabling programs and projects Since 2009 Pre-competitive collaboration between over 100 global organizations, SMEs and academia
  • 16.
    The way biopharmaworks has changed Today & tomorrow Proprietary content providers Public content providers Academic groups Software vendors CROs Service providers Regulatory authorities Exemplar initiatives: The rise of pre-competitive initiatives around data
  • 17.
    • Biopharma R&Dproductivity can be improved by implementing the FAIR Principles • FAIR enables powerful new AI analytics to access data for machine learning and prediction Ø Requirements § financial, technical, training Ø Challenges § change the culture, show business value, achieve the ‘FAIR enough’ on an enterprise scale FAIR as enabler for the digital transformation
  • 19.
    ● No practicaladvice on how to do FAIRification ● Can’t tell how FAIR I am already ● Don’t know how to become more FAIR ● My organisation doesn’t care anyway Deliver the FAIR Cookbook Mature FAIRification processes & build a maturity model Assess FAIR levels of projects and data Change data management culture Objectives and outputs of the FAIRplus project
  • 20.
    Defined by usinga number of IMI health research and innovation projects The FAIRification process
  • 21.
    20% identifiers 80% metadata https://doi.org/10.2777/1524 Two pillars ofFAIR Findable Accessible Interoperable Reusable
  • 22.
    22 Different contexts mandatedifferent strategies Molecular data Clinical (observation based) data Clinical trial (event based) data FAIRification paths in the context of IMI projects
  • 23.
    23 Molecular data Clinical (observationbased) data Clinical trial (event based) data Selecting standard stacks for the FAIRification
  • 24.
    24 Clinical (observation based)data Clinical trial (event based) data Selecting standard stacks for the FAIRification
  • 25.
  • 26.
    390+ 162+ 729+ ~1300 13 MIAME MIRIAM MIQAS MIX MIGEN ARRIVE MIAPE MIASE MIQE MISFISHIE …. REMARK CONSORT SRAxml SOFT FASTA DICOM MzML SBRML SEDML … GELML ISA CML MITAB … AAO CHEBI OBI PATO ENVO MOD BTO IDO … TEDDY PRO XAO DO VO ECnumber URL PURL LSID Handle ORCID RRID InChI … IVOA ID DOI standard organizations grass-roots groups Formats Terminologies Guidelines Identifiers ID COMMUNITY STANDARDS for metadata and identifiers
  • 27.
    DATA & METADATASTANDARDS REPOSITORIES databases and knowledgebases DATA POLICIES by funders, journals and other organizations an informative and educational resource Guides consumers to discover, select and use these resources with confidence Helps producers to make their resources more visible, more widely adopted and cited
  • 28.
    DATA & METADATASTANDARDS REPOSITORIES databases and knowledgebases DATA POLICIES by funders, journals and other organizations Provides curated, community-vetted descriptions and knowledge graphs that represent these resources and their inter-relationships an informative and educational resource
  • 29.
    … 232 standards https://fairsharing.org/collection/ISOCD20691CollectionDRAFT ISO/CD20691 specification that details the requirements for the consistent formatting and documentation of data and metadata in the life sciences and biotechnology, including biomedical research and non-human biological research and development; it covers manual or computational workflows. This FAIRsharing Collection includes the standards detailed in the ISO/CD 20691 specification, and serves as a 'live' list to search and discover these standards, their use by repositories, as well as their evolution over time.
  • 30.
    Publishing the FAIRifieddata and turning knowledge into recipes https://fairplus.github.io/the-fair-cookbook
  • 31.
    ● A comprehensiveresource collating ‘recipes’ for making different types of data FAIR ● Work in progress: currently 50 recipes and growing…. What is it? FAIRification processes
  • 32.
    ● How toFAIRify or improve the FAIRness of exemplar datasets ● Which are the levels and indicators of FAIRness ● Which open source technologies, tools and services are available ● What skills are required ● Awareness of known challenges Learning outcomes
  • 33.
  • 34.
  • 35.
  • 36.
    https://doi.org/10.1038/s41597-019-0286-0 Example of converting Excelfiles to frictionless data package, and then build a semantic model to have a linked data graphs
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
    41 Tracking authors andtheir contributions with the CRedIT vocabulary Open license
  • 42.
    42 Executable code asJupyter notebook
  • 43.
  • 44.
    1. Ontologies 2. Standards 3.Versioning 4. Identifiers 5. Licensing Top needs and challenges FAIRification processes
  • 45.
    FAIRification processes ● Howto measures the FAIRness level of data? ○ For use in the FAIRification processes to define initial/final level of data FAIRness ● How to measures capability and performance of an organization for FAIR data generation and management? ○ For use at the strategy level to identify investment areas, monitor processes ○ E.g. ability to provide ETL capability, an ontology look-up service, or mapping services FAIR indicators and capability maturity model
  • 46.
    The capability maturitymodel Which capabilities are needed to improve data reusability? The optimum level of FAIRness is a trade-off between desired data reuse level and cost to achieve that level
  • 47.
    The capability maturitymodel - the ontology example Which capabilities are needed to improve data reusability? The optimum level of FAIRness is a trade-off between desired data reuse level and cost to achieve that level No use of ontologies Use of internal ontologies Use of community ontologies + Ontology service to manage several ontologies, mapping, versioning etc. + Term suggestion, automatic annotation, terms conflict resolution etc.
  • 48.
    No use of ontologies Useof internal ontologies Use of community ontologies + Ontology service to manage several ontologies, mapping, versioning etc. + Term suggestion, automatic annotation, terms conflict resolution etc. The capability maturity model - the ontology example Robot recipe will help to move from Repeatable to Defined level
  • 49.
    ● Practical adviceon how to do FAIRification ● Exemplar FAIRified data ● Process and maturity model ● My organisation cares about FAIR data Deliver the FAIR Cookbook Mature FAIRification Projects and datasets Change data management culture FAIRplus project - summary
  • 50.
  • 51.
    After FAIR The roadto FAIRness ….from chaos, comes order? Before FAIR
  • 52.
    infrastructures standards tools policies education training cultural normalization incentives long terminvestment It is not simple, but it is no longer optional A FAIRY tale needs some magic The Magic Roundabout in Swindon, England
  • 53.