Assessing FAIR data principles
against the 5-star Open data
principles
Ali Hasnain and Dietrich Rebholz-Schuhmann
June 2019
Outline
• Motivation
• Linked Open Data
• FAIR Data
• 5 Open Data and FAIR Data
• Conclusions
2
Motivation:
• Linked Open Data has been defined and designed
to join up data in the public space
• FAIR data principles have been defined to make
data reusable
• Both share principles of
– Interoperability of data, and
– Findability of data
• How do both principled approaches differ?
• How does LOD relate to FAIR?
3
Data driven world
4
The Linked Data Cloud
5
Heterogeneous Data – Multiple Datasources
DrugBank
DailyMed
CheBI, KEGG
Reactome
Sider
BioPax
Medicare
6
Size & type of data we talk about
Dataset Category Year* Topic Size/ Coverage
Drugbank LODD 2010 Drugs 766920 triples, 4800 drugs
LinkedCT LODD X Clinical Trials 25 M triples, 106000 trials
DailyMed LODD 2010 Drugs 1604983 triples, >36K products
Dbpedia LODD 2009 Drugs/ Diseases/Proteins 218M triples, 2300 drugs, 2200 proteins
Diseasome LODD 2010 Diseases/ Genes 91182 triples, 2600 genes
SIDER LODD 2010 Diseases/ Side Effects 192515 triples, 63K effects, 1737 genes
STITCH LODD 2010 Chemicals/ Proteins 7.5 M chemicals, 0.5 M proteins
ChEMBLE LODD 2010 Assay/ Proteins/ Organisms 130 M triples
Affymetrix Bio2RDF 2014 Microarrays 8694237 triples, 6679943 entities
BioModels Bio2RDF 2014 Biological/ mathematical models 2380009 triples, 188308 entities
BioPortal Bio2RDF 2014 Biological/ biomedical entities 19920395 triples, 2199594 entities
KEGG Bio2RDF 2014 Genes 50197150 triples, 6533307 entities
PharmaG-KB Bio2RDF 2014 Genotypes/ Phenotypes 278049209 triples, 25325504 entities
PubMed Bio2RDF 2014 Citations 5005343905 triples, 412593720 entities
Taxonomy Bio2RDF 2014 Taxonomy 21310356 triples, 1147211 entities
LLD LLD 2014 Drugs, Chromosomes 10192641644 statements
*Statistics as of Aug 2016 (source DataHub) - Year specify the time when the last- most recent version is
produced. “X” means information not available.
7
5 ★ OPEN DATA (I.E. THE LOD)
8
Source:http://5stardata.info/en/
Open data, open science and free
software• Open in contrast to not fenced
• open library (“just walk in”), open society (“allow religious
diversity”), open skies (“allow fly-overs”), open university
(“studies for all”)
• Free software:
• different levels of use and reuse, free like “free speech”,
• not “free beer” and not “free puppy”
• Creative Commons license right parameters
• to govern Copyright:
• Attribution – Non-commercial – Share-like – No Derivatives
• In manufacturing:
• Open Design, Open Source Hardware
• Open means “open access”, e.g., scholarly literature,
data, meta-data, but includes “free” and “public
access”
• Open science: OA to all parts of the scientific process,
but access without use?03/06/2018 INSIGHT@NUIG
Open data, more specifically
• Open Education:
• Reuse of all data, for educational purposes
• Open Government: public access to governmental
data, achieving transparency and participation (Open
Data Governance Board)
• Open Data Institute: Breeding openness through
institutional structures, Insight is part of it
• Openness requires standards for more openness
• Agricultural openness: Open Source Seed Initiative,
others
• But also: “Open source Beer” and “Free Beer
Foundation”
03/06/2018 INSIGHT@NUIG
Data Interoperability, Integration and Reuse is
the Key
nih:EGFR
epidermal growth
factor receptor
Homo
sapiens
CCCCGGCGCAGCGCGGCCGCAGCA
GCCTCCGCCCCCCGCACGGTGTGA
GCGCCCGACGCGGCCGAGGCGG …
nih:EGF
nci:has_description
nih:sequence
nih:organism
nih:interacts
nih:organism
rea:EGFR
rea:Membrane
rea:Receptor
rea:Transferase
rea:keyword
rea:keyword
rea:keyword
NCBI Reactome
sameAs
11
12
Source:
FAIR Data
Findable
– (Meta)data is assigned a globally unique and persistent identifier
– Data is described with rich metadata
– (Meta)data is registered or indexed in a searchable resource
– (Meta)data specifies the data identifier
Accessible
– (Meta)data is retrieved by the identifier using a standardized protocol
– The protocol is open, free and universally implementable
– Protocol allows for an authentication and authorization procedure when
necessary
– The (meta)data remains accessible even when the data is not
Interoperable
– (Meta)data uses a formal, accessible, shared and broadly applicable language for
knowledge representation
– (Meta)data uses vocabularies that follow the Fair Principles
– (Meta)data includes qualified references to other metadata
Reusable
– (Meta)data must have a plurality of accurate and relevant attributes
– (Meta)data is released with a clear and accessible data usage license
– (Meta)data is associated with its provenance
– (Meta)data must meet domain-relevant community standards 13
To be Findable:
• F1. (meta)data are assigned a globally unique and
persistent identifier
– usage of URIs
• F2. data are described with rich metadata
– requires structured representation
– relate to linkage to other data for context
• F3. metadata clearly and explicitly include the
identifier of the data it describes
– corresponds to the usage of URIs
• F4. (meta)data are registered or indexed in a
searchable resource
– about technologies making use of the data, which is the
consequence of data access over the Web
14
To be Accessible:
• A1. (meta)data are retrievable by their identifier using a
standardised communications protocol
– usage of URIs, although other identifiers could be used as well
• A1.1 the protocol is open, free, and universally
implementable
– URIs and related open data standards
• A1.2 the protocol allows for an authentication and
authorisation procedure, where necessary
– the LOD only covers the open access part as the primary
requirement
• A2. metadata are accessible, even when the data are no
longer available
– At best, this corresponds to the URIs,
– but long-term acquisition is not a matter of data standards but
relates to IT infrastructures.
15
To be Interoperable:
• I1. (meta)data use a formal, accessible, shared, and
broadly applicable language for knowledge
representation
– use of ”non-proprietary open formats”
• I2. (meta)data use vocabularies that follow FAIR
principles
– refers to contextual standardised data
• I3. (meta)data include qualified references to other
(meta)data
– refers to contextual standardised data
16
To be Reusable:
• – R1. meta(data) are richly described with a plurality of
accurate and relevant attributes.
– this refers to the structured data, but makes a stronger reference
to contextual information .
• – R1.1. (meta)data are released with a clear and
accessible data usage license
– a statement about the delivery of (meta)data, which is meant to be
open according to LOD .
• – R1.2. (meta)data are associated with detailed
provenance
– provenance is part of the contextual data.
• – R1.3. (meta)data meet domain-relevant community
standards
– community standards are part of the contextual data and should be
openly available. 17
FAIR principles mapping to LOD 5-star
scheme
18
Distinctions and Overlaps
• LOD and non-data assets
– The LOD 5 stars scheme is for the Open Data.
– FAIR data must not be openly available and requires that access to
the license agreement is made available.
– FAIR asks for meta-data to be provided with the data to improve
interoperability, which relates to the contextual information from the
LOD principles through interoperable data linked as one or several
interoperable open data sources.
• Tool and technology independent
– FAIR Guiding Principles are independent of implementation choices,
and do not suggest any specific technology, standard, or
implementation solution.
– FAIR principles are not, themselves, a standard or a specification but
a guide to data publishers and consumers to assist regarding
Findable, Accessible, Interoperable, and Reusable implementations
and resources.
– LOD 5-star scheme are not themselves a standard or a specification
but provides the guideline for choosing the tools and technologies to19
Conclusions
• FAIR principles are focused to user needs around a wide
range of existing data, whereas the LOD principles form an
idealistic approach to a world using open data.
• LOD principles appear to be more stringent in its own
principles, whereas the FAIR data principles can be seen
as reusing the LOD principles without making explicit
reference and with added aspects of separating data into
the core data and the meta-data, including license
considerations.
• The future will tell, whether the FAIR data principles lead to
the reusability of data under the consideration that the
access to meta-data and to license agreements open the
avenues of reusabiltiy, or whether the LOD principles are
the gatekeepers to reusability, since -- a priori -- the data
should be openly available anyways.
20
Thanks for your attention…

Fair data vs 5 star open data final

  • 1.
    Assessing FAIR dataprinciples against the 5-star Open data principles Ali Hasnain and Dietrich Rebholz-Schuhmann June 2019
  • 2.
    Outline • Motivation • LinkedOpen Data • FAIR Data • 5 Open Data and FAIR Data • Conclusions 2
  • 3.
    Motivation: • Linked OpenData has been defined and designed to join up data in the public space • FAIR data principles have been defined to make data reusable • Both share principles of – Interoperability of data, and – Findability of data • How do both principled approaches differ? • How does LOD relate to FAIR? 3
  • 4.
  • 5.
  • 6.
    Heterogeneous Data –Multiple Datasources DrugBank DailyMed CheBI, KEGG Reactome Sider BioPax Medicare 6
  • 7.
    Size & typeof data we talk about Dataset Category Year* Topic Size/ Coverage Drugbank LODD 2010 Drugs 766920 triples, 4800 drugs LinkedCT LODD X Clinical Trials 25 M triples, 106000 trials DailyMed LODD 2010 Drugs 1604983 triples, >36K products Dbpedia LODD 2009 Drugs/ Diseases/Proteins 218M triples, 2300 drugs, 2200 proteins Diseasome LODD 2010 Diseases/ Genes 91182 triples, 2600 genes SIDER LODD 2010 Diseases/ Side Effects 192515 triples, 63K effects, 1737 genes STITCH LODD 2010 Chemicals/ Proteins 7.5 M chemicals, 0.5 M proteins ChEMBLE LODD 2010 Assay/ Proteins/ Organisms 130 M triples Affymetrix Bio2RDF 2014 Microarrays 8694237 triples, 6679943 entities BioModels Bio2RDF 2014 Biological/ mathematical models 2380009 triples, 188308 entities BioPortal Bio2RDF 2014 Biological/ biomedical entities 19920395 triples, 2199594 entities KEGG Bio2RDF 2014 Genes 50197150 triples, 6533307 entities PharmaG-KB Bio2RDF 2014 Genotypes/ Phenotypes 278049209 triples, 25325504 entities PubMed Bio2RDF 2014 Citations 5005343905 triples, 412593720 entities Taxonomy Bio2RDF 2014 Taxonomy 21310356 triples, 1147211 entities LLD LLD 2014 Drugs, Chromosomes 10192641644 statements *Statistics as of Aug 2016 (source DataHub) - Year specify the time when the last- most recent version is produced. “X” means information not available. 7
  • 8.
    5 ★ OPENDATA (I.E. THE LOD) 8 Source:http://5stardata.info/en/
  • 9.
    Open data, openscience and free software• Open in contrast to not fenced • open library (“just walk in”), open society (“allow religious diversity”), open skies (“allow fly-overs”), open university (“studies for all”) • Free software: • different levels of use and reuse, free like “free speech”, • not “free beer” and not “free puppy” • Creative Commons license right parameters • to govern Copyright: • Attribution – Non-commercial – Share-like – No Derivatives • In manufacturing: • Open Design, Open Source Hardware • Open means “open access”, e.g., scholarly literature, data, meta-data, but includes “free” and “public access” • Open science: OA to all parts of the scientific process, but access without use?03/06/2018 INSIGHT@NUIG
  • 10.
    Open data, morespecifically • Open Education: • Reuse of all data, for educational purposes • Open Government: public access to governmental data, achieving transparency and participation (Open Data Governance Board) • Open Data Institute: Breeding openness through institutional structures, Insight is part of it • Openness requires standards for more openness • Agricultural openness: Open Source Seed Initiative, others • But also: “Open source Beer” and “Free Beer Foundation” 03/06/2018 INSIGHT@NUIG
  • 11.
    Data Interoperability, Integrationand Reuse is the Key nih:EGFR epidermal growth factor receptor Homo sapiens CCCCGGCGCAGCGCGGCCGCAGCA GCCTCCGCCCCCCGCACGGTGTGA GCGCCCGACGCGGCCGAGGCGG … nih:EGF nci:has_description nih:sequence nih:organism nih:interacts nih:organism rea:EGFR rea:Membrane rea:Receptor rea:Transferase rea:keyword rea:keyword rea:keyword NCBI Reactome sameAs 11
  • 12.
  • 13.
    FAIR Data Findable – (Meta)datais assigned a globally unique and persistent identifier – Data is described with rich metadata – (Meta)data is registered or indexed in a searchable resource – (Meta)data specifies the data identifier Accessible – (Meta)data is retrieved by the identifier using a standardized protocol – The protocol is open, free and universally implementable – Protocol allows for an authentication and authorization procedure when necessary – The (meta)data remains accessible even when the data is not Interoperable – (Meta)data uses a formal, accessible, shared and broadly applicable language for knowledge representation – (Meta)data uses vocabularies that follow the Fair Principles – (Meta)data includes qualified references to other metadata Reusable – (Meta)data must have a plurality of accurate and relevant attributes – (Meta)data is released with a clear and accessible data usage license – (Meta)data is associated with its provenance – (Meta)data must meet domain-relevant community standards 13
  • 14.
    To be Findable: •F1. (meta)data are assigned a globally unique and persistent identifier – usage of URIs • F2. data are described with rich metadata – requires structured representation – relate to linkage to other data for context • F3. metadata clearly and explicitly include the identifier of the data it describes – corresponds to the usage of URIs • F4. (meta)data are registered or indexed in a searchable resource – about technologies making use of the data, which is the consequence of data access over the Web 14
  • 15.
    To be Accessible: •A1. (meta)data are retrievable by their identifier using a standardised communications protocol – usage of URIs, although other identifiers could be used as well • A1.1 the protocol is open, free, and universally implementable – URIs and related open data standards • A1.2 the protocol allows for an authentication and authorisation procedure, where necessary – the LOD only covers the open access part as the primary requirement • A2. metadata are accessible, even when the data are no longer available – At best, this corresponds to the URIs, – but long-term acquisition is not a matter of data standards but relates to IT infrastructures. 15
  • 16.
    To be Interoperable: •I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation – use of ”non-proprietary open formats” • I2. (meta)data use vocabularies that follow FAIR principles – refers to contextual standardised data • I3. (meta)data include qualified references to other (meta)data – refers to contextual standardised data 16
  • 17.
    To be Reusable: •– R1. meta(data) are richly described with a plurality of accurate and relevant attributes. – this refers to the structured data, but makes a stronger reference to contextual information . • – R1.1. (meta)data are released with a clear and accessible data usage license – a statement about the delivery of (meta)data, which is meant to be open according to LOD . • – R1.2. (meta)data are associated with detailed provenance – provenance is part of the contextual data. • – R1.3. (meta)data meet domain-relevant community standards – community standards are part of the contextual data and should be openly available. 17
  • 18.
    FAIR principles mappingto LOD 5-star scheme 18
  • 19.
    Distinctions and Overlaps •LOD and non-data assets – The LOD 5 stars scheme is for the Open Data. – FAIR data must not be openly available and requires that access to the license agreement is made available. – FAIR asks for meta-data to be provided with the data to improve interoperability, which relates to the contextual information from the LOD principles through interoperable data linked as one or several interoperable open data sources. • Tool and technology independent – FAIR Guiding Principles are independent of implementation choices, and do not suggest any specific technology, standard, or implementation solution. – FAIR principles are not, themselves, a standard or a specification but a guide to data publishers and consumers to assist regarding Findable, Accessible, Interoperable, and Reusable implementations and resources. – LOD 5-star scheme are not themselves a standard or a specification but provides the guideline for choosing the tools and technologies to19
  • 20.
    Conclusions • FAIR principlesare focused to user needs around a wide range of existing data, whereas the LOD principles form an idealistic approach to a world using open data. • LOD principles appear to be more stringent in its own principles, whereas the FAIR data principles can be seen as reusing the LOD principles without making explicit reference and with added aspects of separating data into the core data and the meta-data, including license considerations. • The future will tell, whether the FAIR data principles lead to the reusability of data under the consideration that the access to meta-data and to license agreements open the avenues of reusabiltiy, or whether the LOD principles are the gatekeepers to reusability, since -- a priori -- the data should be openly available anyways. 20
  • 21.
    Thanks for yourattention…

Editor's Notes

  • #5 We live in a world of data!
  • #12 EGFR: Epidermal growth factor receptor
  • #22 Thank you for your attention and I open the floor for any questions and comments.