FAIR Data Knowledge Graphs

FAIR* Data Knowledge Graphs
Tom Plasterer, PhD
Director, Bioinformatics, Research Bioinformatics 11 Mar 2019
* Findable, Accessible, Interoperable and Reusable

What do R&D Researchers want the ability to do?
3
• Gain a greater understanding of
the biology of the molecular
mechanisms of diseases
• Use the human as a model
organism to a greater degree
• Discover how the microbiome is
involved with human
pathogenesis
• Understanding molecular
mechanisms of drug failures
• Use patient-level clinical data to
identify subphenotypes of
diseases
Integrative Informatics: A hybrid approach to
integrating data for Drug Discovery
@Mathew Woodwark;
Pharma 2020: March 28, 2018

Can R&D researchers do these things today?
4
• Currently, data exists in file shares, on
laptops, eLN, in silos of managed
systems and unknown places
• The level of data integration is
immature and fragmented
• Using systems biology approaches
requires considerable time and effort
• Bioinformatics groups become a
bottleneck to analyzing data
• Research scientists not empowered
to use information and knowledge to
answer complex questions
Integrative Informatics: A hybrid approach to
integrating data for Drug Discovery
@Mathew Woodwark;
Pharma 2020: March 28, 2018

5
IIx Approach: Build a FAIR Data Knowledge Graph

6
FAIR Principles: One-Slide Overview
Findable:
• F1 (meta)data are assigned a globally
unique and persistent identifier
• F2 data are described with rich metadata
• F3 metadata clearly and explicitly include
the identifier of the data it describes
• F4 (meta)data are registered or indexed in a
searchable resource
The FAIR Guiding Principles for scientific data management and stewardship
Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016)
Accessible:
• A1 (meta)data are retrievable by their identifier
using a standardized communications protocol
• A1.1 the protocol is open, free, and universally
implementable
• A1.2 the protocol allows for an authentication and
authorization procedure, where necessary;
• A2 metadata are accessible, even when the data
are no longer available;
Interoperable:
• I1 (meta)data use a formal, accessible,
shared, and broadly applicable language for
knowledge representation
• I2 (meta)data use vocabularies that follow
FAIR principles
• I3 (meta)data include qualified references to
other (meta)data
Reusable:
• R1 meta(data) are richly described with a plurality
of accurate and relevant attributes
• R1.1 (meta)data are released with a clear and
accessible data usage license
• R1.2 (meta)data are associated with detailed
provenance
• R1.3 (meta)data meet domain-relevant
community standards

7
Knowledge Graph: Definition(s)…

8
Knowledge Graph: Innovation Trigger
Gartner Identifies Five Emerging Technology
Trends That Will Blur the Lines Between
Human and Machine

9
Knowledge Graph: Key Features and Differentiators
Federation:
• Leave Data in place or ETL pipeline?
• URIs, indices really important
Standards Support (Syntactic and Semantic)
• Universal structure or bespoke?
• Universal query language or bespoke?
Analytics Enablement
• Reasoning, inferencing, graph methodologies
Hybrid
• Underlying data in multiple shapes and
repositories
For Machines (and occasionally people)
Cypher

10
Starting Point: Modeling Business Questions
core:Study
core:Project
core:Target
core:Subject
core:Drug
core:Indication core:TherapeuticArea
core:BiologicalSample
core:Measurement core:Technologycore:Visit
bdm:Cohort
core:hasSubject
core:hasProject
core:hasDrug
core:hasIndication
bdm:hasArm
bdm:participatesIn
core:hasTA
core:hasTarget
core:hasMeasurement
core:hasSample
core:hasVisit
core:measuredBy
Find all subjects
diagnosed with SLE
with a disease activity
score > 5
Find all studies evaluating
the target PD-L1 with
RNA Seq Datasets
bnav:measuredInStudy

11
Challenge is determining the “stickiest”
representation for a given instance
• Studies all have a ‘D’-code and then a
number of other internal and external
identifiers
• API calls to an internal clinical study API
and an external (licensed content) API to
obtain the exact matches
(skos:exactMatch)
• Process is abstracted in an Enrichment
Service
• New relationships (triples) are added to
the wrapped data model and pushed into
a knowledge graph
Enrichment: Core Ontology Classes & API mapping
core:Study
http://data.rd.astrazeneca.net/study/bdm/CP1103
http://clinicaltrials.astrazeneca.net/study/D4660C00001
http://identifiers.org/clinicaltrials/NCT01448850
http://trialtrove.citeline.com/ClinicalTrial/154466
skos:exactMatch
"azct:D4660C00001"
"ctg:NCT01448850"
"trialtrove:154466"
dct:identifier

12
Now find “stickiest”
representation for a given
instance from a label
• Use system label for the
indication
• Send to Enrichment API
(augmented public disease
vocabularies) and generate the
preferred URI to obtain the close
matches (skos:closeMatch)
• Process is abstracted in an
Enrichment Service
• New relationships (triples) are
added to the wrapped data
model and pushed into a
knowledge graph
Enrichment: Core Ontology Classes & Label Matching
core:Indication
http://data.rd.astrazeneca.net/indication/bdm/Rheumatoid%20Arthritis
http://purl.obolibrary.org/obo/DOID_7148
http://identifiers.org/mesh/D001172
skos:closeMatch
"Rheumatoid Arthritis (D001172) "
bnav:diseaseNameSymbol
"Rheumatoid Arthritis"
skos:prefLabel

13
Now find “stickiest” representation
for a given instance from a label
without a good vocabulary
• Aligned internal Technology
vocabulary with best public label
and URI
• Send to Enrichment API
(augmented BDM-technology
vocabulary) and generate the
preferred URI to obtain the close
matches (skos:exactMatch)
• Process is abstracted in an
Enrichment Service
• New relationships (triples) are
added to the wrapped data model
and pushed into a knowledge graph
Enrichment: Core Ontology Classes & Mixed Vocabs
core:Technology
http://data.rd.astrazeneca.net/technology/bdm/BDMTECH00005
"Blood Gas"
skos:prefLabel
http://identifiers.org/ncit/C71252
skos:exactMatch
"Arterial Blood Gas Measurement"
skos:prefLabel

14
Key Lesson: Where is Enrichment Critical?
core:Study
core:Project
core:Target
core:Subject
core:Drug
core:Indication core:TherapeuticArea
core:BiologicalSample
core:Measurement core:Technologycore:Visit
bdm:Cohort
core:hasSubject
core:hasProject
core:hasDrug
core:hasIndication
bdm:hasArm
bdm:participatesIn
core:hasTA
core:hasTarget
core:hasMeasurement
core:hasSample
core:hasVisit
core:measuredBy
External
Internal
Mix

15
Dataset Catalogs: Find me Datasets about:
Projects
Study
Indication/
Disease
Technology
Targets
Cohort DatesAgent
Therapeutic
Area
Drugs

16
Dataset Catalog is a collection of Dataset Records
• Catalogs are needed to supporting FAIR (Findable) data
• Catalogs can and should support Enterprise MDM strategies
• Consumers can be internal or external
Dataset Catalogs are needed so data consumers can find Datasets
• Dataset records need sufficient metadata to support discoverability
• Dataset terms are NOT the data instance
Dataset Catalogs surface dataset provenance and enable data access
Dataset Catalogs can provide datasets for multiple consumption patters
• Analytics readiness and fit
• ‘Walking’ across information models
Dataset Catalogs: Findability Starts Here

17
The Backbone: A DCAT conformant Data Catalog
https://www.w3.org/TR/hcls-dataset/
https://www.w3.org/TR/vocab-dcat/#vocabulary-overview
Semantic tagging of datasets with
concepts from taxonomies:
• provides context
• multi-dimensional & flexible
• effective for discoverability
• light-weight semantics
skos:Concept
dcat:Catalog skos:ConceptScheme
dctypes:Dataset (summary)
dct:title
dct:publisher <foaf:Agent>
foaf:page
void:sparqlEndpoint
dct:accrualPeriodicity
dcat:keyword
dcat:dataset
dcat:theme
dctypes:Dataset (version)
dcat:Distribution
(dctypes:Dataset)
void:vocabulary
dct:conformsTo
void:exampleResource
…other void properties
dcat:distribution
dcat:themeTaxonomy
dct:isVersionOf
pav:previousVersion
dct:hasPart
pav:hasCurrentVersion
dct:hasPart
dct:title
pav:version
dct:creator <foaf:Agent>
dct:created
dct:source
dct:creator <foaf:Agent>
dct:license
dct:format
pav:retrievedFrom
dct:created
pav:createdWith
dcat:accessURL
dcat:downloadURL
void:Dataset
dct:title
dctDescription

Data Discoverability: Multi-phase Filtering
Data Catalog Filter
Phase 1
Experiment Metadata Filter
Phase 2
Ad hoc Analyses Filtering
Phase 3
Outbound
to Data Analytics
Data Science
Tools
Statistical
Filtering
e.g., clinical trial with > 50
participants
Dataset
Catalog
Descriptions

R&D | RDI
Multi-Phase Filtering joins the Catalog and Domain Model
• Balance to what belongs in a catalog record vs. instance data
Public Domain Ontologies and Identifiers should be reused
• Consensus is emerging around best practices and cross-mapping
DCTERMS, DCAT, VoID are almost sufficient
• Extend for local needs
Lots of Activity to Learn and Shape Best Practices
• Didn’t reinvent a wheel
FAIR Knowledge Graph: Take-aways

R&D | RDI
Thanks
Key Influencers
David Wood
Tim Berners-Lee
Lee Harland
Jane Lomax
James Malone
Dean Allemang
Barend Mons
Carole Goble
Bernadette Hyland
Bob Stanley
Eric Little
Michel Dumontier
John Wilbanks
Hans Constandt
Filip Pattyn
Dan Crowther
Tim Hoctor
Ian Harrow
AstraZeneca/Pistoia FAIR
Data Community
Mathew Woodwark
Rajan Desai
Nic Sinibaldi
Chia-Chien Chiang
Kerstin Forsberg
Ola Engkvist
Ian Dix
Colin Wood
Ted Slater
Martin Romacker
Eric Neumann
Jeff Saltzman
Kathy Reinold
Nirmal Keshava
Bryan Takasaki

FAIR Data Knowledge Graphs

More Related Content

What's hot

Similar to FAIR Data Knowledge Graphs

More from Tom Plasterer

Recently uploaded

FAIR Data Knowledge Graphs

Editor's Notes