Making Data FAIR*
Tom Plasterer, PhD
Director, Bioinformatics, Research Bioinformatics 20 Mar 2019
* Findable, Accessible, Interoperable and Reusable
3
What FAIR: Principles at-a-Glance
Findable:
• F1 (meta)data are assigned a globally
unique and persistent identifier
• F2 data are described with rich metadata
• F3 metadata clearly and explicitly include
the identifier of the data it describes
• F4 (meta)data are registered or indexed in a
searchable resource
The FAIR Guiding Principles for scientific data management and stewardship
Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016)
Accessible:
• A1 (meta)data are retrievable by their identifier
using a standardized communications protocol
• A1.1 the protocol is open, free, and universally
implementable
• A1.2 the protocol allows for an authentication and
authorization procedure, where necessary;
• A2 metadata are accessible, even when the data
are no longer available;
Interoperable:
• I1 (meta)data use a formal, accessible,
shared, and broadly applicable language for
knowledge representation
• I2 (meta)data use vocabularies that follow
FAIR principles
• I3 (meta)data include qualified references to
other (meta)data
Reusable:
• R1 meta(data) are richly described with a plurality
of accurate and relevant attributes
• R1.1 (meta)data are released with a clear and
accessible data usage license
• R1.2 (meta)data are associated with detailed
provenance
• R1.3 (meta)data meet domain-relevant
community standards
4
Collaborative & Competitive Intelligence:
• Who do we want to partner with? Are there complementary assets to our portfolio?
• What space is too crowded and not our area of expertise?
• Greenfield situations?
Mergers, Acquisitions, Partnerships:
• How do we efficiently and deeply absorb data generated elsewhere into our systems? How
do we efficiently share?
• Does this make a smaller biotech/start-up a more viable partner?
Improved Patient Care:
• Can we share data and outcomes more efficiently in complicated trial settings (basket trials,
adaptive trials) to better engage opinion leaders and foster dialog?
• Along with Differential Privacy approaches, can we have the broader research community
help mine our data?
• How do we best reuse Real World Evidence (RWE) data in the clinic and in trial design?
Data (Ir)-reproducibility:
• Can we make preclinical data (more)-reproducible?
• Can we utilize data credentialization? (thanks to Dan Crowther @ Exscientia)
Why FAIR: Biopharma Value Proposition
5
Why FAIR: €26bn Reasons…
6
When FAIR: A Brief History
Moving away from Narrative
• Nanopublications
Incubating Standards in Open PHACTS
• VoID, PROV-O
Lorentz Center Workshop
• FORCE 11 FAIR Guiding Principles
• Participants: IMI members, US researchers,
Content providers, ELIXIR; European Open
Science Cloud, Big Data to Knowledge (BD2K)
Current Status:
• FAIR Data Workshops (EU-ELIXIR nodes)
• Inclusion in Horizon 2020, NIH Advocacy
• IMI2 Data FAIR-ification Call
• Vendors getting up to speed
7
Linked Data Community of Practice
How familiar are you with the
FAIR principles and metrics?
When FAIR: Community Awareness
8
Linked Data Community of Practice
What is the maturity
level of your
organization with
respect to
implementation of
FAIR?
When FAIR: Getting Started
9
How FAIR: Pistoia FAIR Implementation Group
• Business challenge:
- Effective application and analysis of data
assets in life science industry demands that
it is made Findable, Accessible,
Interoperable and Reusable
• Update and plans:
- Workshop at The Hyve, Utrecht NL in June
2018 resulted in a published feature
article:-
- Workshop at EPAM, Boston US in Dec
2018 contributed to the business case
thinking
- Phase 1 for 2019 plans:-
• Develop the business case to define
distinctive role for the project
• Develop the FAIR Toolkit concept
• Select a use case: e.g. clinical science
to engage with CROs at a workshop
- Seeking more funding – join us!
PM: Ian Harrow Collaborators
1.Metric Tools & Best Practice
2.Training resources
3.Culture change process
4.Use case examples
5.Cost benefit examples
• Adapt for Life Science industry
• Leverage existing FAIR resources
FAIR Toolkit
Implementation
for LS Industry
FAIR
10
How FAIR: Pistoia Ontologies Mapping Project
• Business challenge:
– Use of different ontologies within
same data domain hampers
interoperability and application.
Solve by mapping between them.
• Update and plans:
– Phase 3 completed by end of 2018
• Predicted mappings delivered as a
prototype Ontology Mapping Service
for phenotype and disease domain
• Mappings will be available through
public wiki and OxO mapping repository
at EMBL-EBI
• Mapping algorithm, Paxo is available
openly on GitHub
– Phase 4 for 2019 plans:-
• To extend mapping of biological and
chemical ontologies for support of
laboratory analytics
• FAIR implementation is planned
– Seeking more funding – join us!
PartnersPM: Ian Harrow
11
How FAIR:
12
How FAIR: Implementation Networks
13
How FAIR:
Overview:
• ELIXIR - Project Coordinator & Janssen - Project Leader
• 22 participants with 12 academic, 7 EFPIA, 3 SME
• €8.23M budget with €4M H2020 EC funding + €4.23M EFPIA in-kind
• 42 months
Goals:
• Establish a value-based process for prioritization and selection of IMI project databases
• Develop FAIRification toolkit e.g. develop guidelines, tools and metrics - FAIR Cookbook
• Apply this toolkit to FAIRify datasets from selected IMI projects and EFPIA companies
• Deliver training for data handlers (academia, SMEs and pharmaceuticals) to change and
sustain the data management culture
• Foster and innovation ecosystem on FAIR open data to power future reuse, knowledge
generation and societal benefit e.g. FAIR innovation and SME events
Members:
PM: Serena Scollen
14
How FAIR: Concept
15
How FAIR: FAIR Metrics &
17
Start FAIR: Find me Datasets about:
Projects
Study
Indication/
Disease
Technology
Targets
Cohort DatesAgent
Therapeutic
Area
Drugs
18
Dataset Catalog is a collection of Dataset Records
• Catalogs are needed to supporting FAIR (Findable) data
• Catalogs can and should support Enterprise MDM strategies
• Consumers can be internal or external
Dataset Catalogs are needed so data consumers can find Datasets
• Dataset records need sufficient metadata to support discoverability
• Dataset terms are NOT the data instance
Dataset Catalogs surface dataset provenance and enable data access
Dataset Catalogs can provide datasets for multiple consumption patters
• Analytics readiness and fit
• ‘Walking’ across information models
Start FAIR: Findability Starts with Catalogs
19
Start FAIR: A DCAT conformant Data Catalog
https://www.w3.org/TR/hcls-dataset/
https://www.w3.org/TR/vocab-dcat/#vocabulary-overview
Semantic tagging of datasets with
concepts from taxonomies:
• provides context
• multi-dimensional & flexible
• effective for discoverability
• light-weight semantics
skos:Concept
dcat:Catalog skos:ConceptScheme
dctypes:Dataset (summary)
dct:title
dct:publisher <foaf:Agent>
foaf:page
void:sparqlEndpoint
dct:accrualPeriodicity
dcat:keyword
dcat:dataset
dcat:theme
dctypes:Dataset (version)
dcat:Distribution
(dctypes:Dataset)
void:vocabulary
dct:conformsTo
void:exampleResource
…other void properties
dcat:distribution
dcat:themeTaxonomy
dct:isVersionOf
pav:previousVersion
dct:hasPart
pav:hasCurrentVersion
dct:hasPart
dct:title
dct:publisher <foaf:Agent>
pav:version
dct:creator <foaf:Agent>
dct:created
dct:source
dct:creator <foaf:Agent>
dct:license
dct:format
pav:retrievedFrom
dct:created
pav:createdWith
dcat:accessURL
dcat:downloadURL
void:Dataset
dct:title
dctDescription
dct:publisher <foaf:Agent>
Start FAIR: Dataset to Knowlege Graph to Analytics
Data Catalog Filter
Phase 1
Experiment Metadata Filter
Phase 2
Ad hoc Analyses Filtering
Phase 3
Outbound
to Data Analytics
Data Science
Tools
Statistical
Filtering
e.g., clinical trial with > 50
participants
Dataset
Catalog
Descriptions
R&D | RDI
Why FAIR?
• Cost avoidance, Business Advantage, Data Stewardship
When FAIR?
• Now! Peers, especially in Europe, are doing it
How FAIR?
• FAIRplus, GO-FAIR, Pistoia FAIR Implementation Group
Start FAIR
• Findability first, adopt a FAIR-compliant Data Catalog
FAIR-for-Biopharma: Take-aways
R&D | RDI
Thanks
Key Influencers
David Wood
Tim Berners-Lee
Lee Harland
Jane Lomax
James Malone
Dean Allemang
Barend Mons
Carole Goble
Bernadette Hyland
Bob Stanley
Eric Little
Michel Dumontier
John Wilbanks
Hans Constandt
Filip Pattyn
Tim Hoctor
Kees Van Boche
Serena Scollen
AstraZeneca/Pistoia FAIR
Data Community
Mathew Woodwark
Rajan Desai
Nic Sinibaldi
Chia-Chien Chiang
Kerstin Forsberg
Ola Engkvist
Ian Dix
Colin Wood
Ted Slater
Martin Romacker
Eric Neumann
John Wise
Carmen Nitsche
Ian Harrow
Jeff Saltzman
Kathy Reinold

Making Data FAIR (Findable, Accessible, Interoperable, Reusable)

  • 1.
    Making Data FAIR* TomPlasterer, PhD Director, Bioinformatics, Research Bioinformatics 20 Mar 2019 * Findable, Accessible, Interoperable and Reusable
  • 2.
    3 What FAIR: Principlesat-a-Glance Findable: • F1 (meta)data are assigned a globally unique and persistent identifier • F2 data are described with rich metadata • F3 metadata clearly and explicitly include the identifier of the data it describes • F4 (meta)data are registered or indexed in a searchable resource The FAIR Guiding Principles for scientific data management and stewardship Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016) Accessible: • A1 (meta)data are retrievable by their identifier using a standardized communications protocol • A1.1 the protocol is open, free, and universally implementable • A1.2 the protocol allows for an authentication and authorization procedure, where necessary; • A2 metadata are accessible, even when the data are no longer available; Interoperable: • I1 (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation • I2 (meta)data use vocabularies that follow FAIR principles • I3 (meta)data include qualified references to other (meta)data Reusable: • R1 meta(data) are richly described with a plurality of accurate and relevant attributes • R1.1 (meta)data are released with a clear and accessible data usage license • R1.2 (meta)data are associated with detailed provenance • R1.3 (meta)data meet domain-relevant community standards
  • 3.
    4 Collaborative & CompetitiveIntelligence: • Who do we want to partner with? Are there complementary assets to our portfolio? • What space is too crowded and not our area of expertise? • Greenfield situations? Mergers, Acquisitions, Partnerships: • How do we efficiently and deeply absorb data generated elsewhere into our systems? How do we efficiently share? • Does this make a smaller biotech/start-up a more viable partner? Improved Patient Care: • Can we share data and outcomes more efficiently in complicated trial settings (basket trials, adaptive trials) to better engage opinion leaders and foster dialog? • Along with Differential Privacy approaches, can we have the broader research community help mine our data? • How do we best reuse Real World Evidence (RWE) data in the clinic and in trial design? Data (Ir)-reproducibility: • Can we make preclinical data (more)-reproducible? • Can we utilize data credentialization? (thanks to Dan Crowther @ Exscientia) Why FAIR: Biopharma Value Proposition
  • 4.
  • 5.
    6 When FAIR: ABrief History Moving away from Narrative • Nanopublications Incubating Standards in Open PHACTS • VoID, PROV-O Lorentz Center Workshop • FORCE 11 FAIR Guiding Principles • Participants: IMI members, US researchers, Content providers, ELIXIR; European Open Science Cloud, Big Data to Knowledge (BD2K) Current Status: • FAIR Data Workshops (EU-ELIXIR nodes) • Inclusion in Horizon 2020, NIH Advocacy • IMI2 Data FAIR-ification Call • Vendors getting up to speed
  • 6.
    7 Linked Data Communityof Practice How familiar are you with the FAIR principles and metrics? When FAIR: Community Awareness
  • 7.
    8 Linked Data Communityof Practice What is the maturity level of your organization with respect to implementation of FAIR? When FAIR: Getting Started
  • 8.
    9 How FAIR: PistoiaFAIR Implementation Group • Business challenge: - Effective application and analysis of data assets in life science industry demands that it is made Findable, Accessible, Interoperable and Reusable • Update and plans: - Workshop at The Hyve, Utrecht NL in June 2018 resulted in a published feature article:- - Workshop at EPAM, Boston US in Dec 2018 contributed to the business case thinking - Phase 1 for 2019 plans:- • Develop the business case to define distinctive role for the project • Develop the FAIR Toolkit concept • Select a use case: e.g. clinical science to engage with CROs at a workshop - Seeking more funding – join us! PM: Ian Harrow Collaborators 1.Metric Tools & Best Practice 2.Training resources 3.Culture change process 4.Use case examples 5.Cost benefit examples • Adapt for Life Science industry • Leverage existing FAIR resources FAIR Toolkit Implementation for LS Industry FAIR
  • 9.
    10 How FAIR: PistoiaOntologies Mapping Project • Business challenge: – Use of different ontologies within same data domain hampers interoperability and application. Solve by mapping between them. • Update and plans: – Phase 3 completed by end of 2018 • Predicted mappings delivered as a prototype Ontology Mapping Service for phenotype and disease domain • Mappings will be available through public wiki and OxO mapping repository at EMBL-EBI • Mapping algorithm, Paxo is available openly on GitHub – Phase 4 for 2019 plans:- • To extend mapping of biological and chemical ontologies for support of laboratory analytics • FAIR implementation is planned – Seeking more funding – join us! PartnersPM: Ian Harrow
  • 10.
  • 11.
  • 12.
    13 How FAIR: Overview: • ELIXIR- Project Coordinator & Janssen - Project Leader • 22 participants with 12 academic, 7 EFPIA, 3 SME • €8.23M budget with €4M H2020 EC funding + €4.23M EFPIA in-kind • 42 months Goals: • Establish a value-based process for prioritization and selection of IMI project databases • Develop FAIRification toolkit e.g. develop guidelines, tools and metrics - FAIR Cookbook • Apply this toolkit to FAIRify datasets from selected IMI projects and EFPIA companies • Deliver training for data handlers (academia, SMEs and pharmaceuticals) to change and sustain the data management culture • Foster and innovation ecosystem on FAIR open data to power future reuse, knowledge generation and societal benefit e.g. FAIR innovation and SME events Members: PM: Serena Scollen
  • 13.
  • 14.
  • 16.
    17 Start FAIR: Findme Datasets about: Projects Study Indication/ Disease Technology Targets Cohort DatesAgent Therapeutic Area Drugs
  • 17.
    18 Dataset Catalog isa collection of Dataset Records • Catalogs are needed to supporting FAIR (Findable) data • Catalogs can and should support Enterprise MDM strategies • Consumers can be internal or external Dataset Catalogs are needed so data consumers can find Datasets • Dataset records need sufficient metadata to support discoverability • Dataset terms are NOT the data instance Dataset Catalogs surface dataset provenance and enable data access Dataset Catalogs can provide datasets for multiple consumption patters • Analytics readiness and fit • ‘Walking’ across information models Start FAIR: Findability Starts with Catalogs
  • 18.
    19 Start FAIR: ADCAT conformant Data Catalog https://www.w3.org/TR/hcls-dataset/ https://www.w3.org/TR/vocab-dcat/#vocabulary-overview Semantic tagging of datasets with concepts from taxonomies: • provides context • multi-dimensional & flexible • effective for discoverability • light-weight semantics skos:Concept dcat:Catalog skos:ConceptScheme dctypes:Dataset (summary) dct:title dct:publisher <foaf:Agent> foaf:page void:sparqlEndpoint dct:accrualPeriodicity dcat:keyword dcat:dataset dcat:theme dctypes:Dataset (version) dcat:Distribution (dctypes:Dataset) void:vocabulary dct:conformsTo void:exampleResource …other void properties dcat:distribution dcat:themeTaxonomy dct:isVersionOf pav:previousVersion dct:hasPart pav:hasCurrentVersion dct:hasPart dct:title dct:publisher <foaf:Agent> pav:version dct:creator <foaf:Agent> dct:created dct:source dct:creator <foaf:Agent> dct:license dct:format pav:retrievedFrom dct:created pav:createdWith dcat:accessURL dcat:downloadURL void:Dataset dct:title dctDescription dct:publisher <foaf:Agent>
  • 19.
    Start FAIR: Datasetto Knowlege Graph to Analytics Data Catalog Filter Phase 1 Experiment Metadata Filter Phase 2 Ad hoc Analyses Filtering Phase 3 Outbound to Data Analytics Data Science Tools Statistical Filtering e.g., clinical trial with > 50 participants Dataset Catalog Descriptions
  • 20.
    R&D | RDI WhyFAIR? • Cost avoidance, Business Advantage, Data Stewardship When FAIR? • Now! Peers, especially in Europe, are doing it How FAIR? • FAIRplus, GO-FAIR, Pistoia FAIR Implementation Group Start FAIR • Findability first, adopt a FAIR-compliant Data Catalog FAIR-for-Biopharma: Take-aways
  • 21.
    R&D | RDI Thanks KeyInfluencers David Wood Tim Berners-Lee Lee Harland Jane Lomax James Malone Dean Allemang Barend Mons Carole Goble Bernadette Hyland Bob Stanley Eric Little Michel Dumontier John Wilbanks Hans Constandt Filip Pattyn Tim Hoctor Kees Van Boche Serena Scollen AstraZeneca/Pistoia FAIR Data Community Mathew Woodwark Rajan Desai Nic Sinibaldi Chia-Chien Chiang Kerstin Forsberg Ola Engkvist Ian Dix Colin Wood Ted Slater Martin Romacker Eric Neumann John Wise Carmen Nitsche Ian Harrow Jeff Saltzman Kathy Reinold

Editor's Notes

  • #4 Eric Schulte’s talk: Ready, Set, GO-FAIR: https://vimeo.com/282650465
  • #5 50% (or higher) preclinical research could not be reproduced with a cost of $28B/year http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002165 Pistoia paper: Implementation and relevance of FAIR data principles in biopharmaceutical R&D; https://www.ncbi.nlm.nih.gov/pubmed/30690198
  • #6 https://dx.doi.org/10.2777/02999 https://publications.europa.eu/en/publication-detail/-/publication/d375368c-1a0a-11e9-8d04-01aa75ed71a1/language-en
  • #7 EU Research and Innovation programme ever with nearly €80 billion of funding available over 7 years (2014 to 2020)
  • #16 http://fairmetrics.org/ https://fairshake.cloud/?q=TCGA
  • #18 Images: http://senior-project-led-cube.wikispaces.com/ (https://creativecommons.org/licenses/by-sa/3.0/) http://opensource.org/node/688 (https://creativecommons.org/licenses/by/4.0/)