Pistoia Alliance Webinar
FAIR by Design
14th May 2020
15.00 to 16.30 BST
This webinar is being recorded
Audience Q&A
Please use the questions box
Introduction
Ian Harrow, Project Manager
Pistoia Alliance
©PistoiaAlliance
Themes and Objectives
5
• To position FAIR as a key enabler to automate and accelerate
R&D process workflows
• FAIR Implementation by design within the context of a use case
• Grounded in precise outcomes
– e.g. faster and bigger science / more reuse of data to enhance value /
increased ability to share data for collaboration and partnership
• To make data actionable especially through FAIR interoperability
©PistoiaAlliance
Bios
6
Mathew Woodwark
• Head of Data Infrastructure, Tools, Data
Science and AI at AstraZeneca
• Experienced Informatics and Information
Management professional, with an
established track record of delivery
• Combines biological understanding,
organisational psychology and technical
skills to managing a wide portfolio of
complex Informatics projects
Erik Schultes
• International Science Coordinator for the
GO-FAIR International Support and
Coordination Office in Leiden, The
Netherlands.
• Previously held appointments at Duke
University Medical Center and Leiden
University Medical Center.
• Erik has worked on data intensive projects
within academia and the private sector.
Georges Heiter
• Founder & CEO for the
Databiology
• Provides biomedical information
management and orchestration for
the life sciences and healthcare
sectors.
• Enables global distribution of
biomedical data, applications and
infrastructure.
©PistoiaAlliance
Agenda
7
Time
(BST)
Title Presenter
15:00 Introductions & housekeeping Ian Harrow,
Pistoia Alliance
15:05 Case Study: AZ’s Science Data Foundation Mathew Woodwark
15:25 FAIR digital objects for automating processes Erik Schultes
15:45 FAIR automation workflows and applications Georges Heiter
16:05 Panel All speakers
Moderator: Ian Harrow
16:30 Close
AstraZeneca’s Science Data Foundation:
Analytics-ready data for machine learning and AI
Mathew Woodwark
Head of Data Infrastructure and Tools
Data Science and AI
Pistoia European Conference, 11th March
AstraZeneca generates and has access to more data than ever before.
Target ID
Target
Validation
Discovery
Pre-
Clinical
Clinical
Commerci
al
Post
Marketing
Surveillanc
e
Genetic &
Genomic Data
Patient-Centric
Data
Sensors &
Smart Devices
Interactive
Media
Healthcare Information
network
Market
Data
3
A concerted effort is required to shape and govern data,
transforming it into a strategic asset.
From disconnected internal databases and external sources to data that is
FAIR: Findable, Accessible, Interoperable, Reusable
ADME
Imaging
RWE In Vivo
Biology
In Silico
Clinical
Trial
Phenotypic
Screens
HTS
Genomic
Pharmacology
Toxicology
Biomarkers
Efficacy
Literature
Chemistry
Genomics
EHR
The way we analyse data is changing.
Connected data allows us to unleash the power of AI.
Today
Security/privacy
is a key consideration
<2
years
>5
years
INDIVIDUAL
DATA TYPES
CONNECTION OF
DATA TYPES
ALGORITHMIC
INTELLIGENCE
Genomics
Sensor/
smart
EHR
Market
Interactive
media
Sensor/
smart
Market
Interactive
media
5
Data Science uses scientific methods, processes and AI algorithms
to extract insights from these data.
Artificial Intelligence
Any process, task or decision where computerised technology may in some way mimic and/or replace
human intelligence.
Machine Learning
Using algorithms to give a computer system the ability to ‘learn for
itself’ deriving patterns and rules from data it is exposed to, as
opposed to explicit programming.
Manual feature extraction
Deep Learning
A type of machine learning
mimicking the dense set of
interconnections in our brains.
1950 1980 2010
Automated
feature extraction
6
Big Data /
Cognitive
Computing
Robots /
automation
Sensors / IoT
NLP / NLG /
NLU
Computer
vision / image
processing
Neural networks
/ deep learning
Statistical /
machine
learning
Chatbots /
assistants
AI is a diverse and constantly changing set of disciplines.
AI is any process, task or decision whereby a computerised technology may in some way mimic
and/or replace human intelligence.
7
Opportunities to extract scientific insights using
Data Science and Artificial Intelligence (AI) exist across R&D.
Target
identification
less
attrition
Trial
Optimization
faster and
more
efficient
Imaging
less time
Personalised
Medicine
the right
medications
for the right
patients
Clinical
real-time
data
innovative
trials
10% 30%
Machine Learning Ÿ Visual Analytics Ÿ Advanced Statistics Ÿ Neural Networks Ÿ Data Exploration
Signal processing Ÿ Natural Language Processing Ÿ Math. Modeling Ÿ Knowledge Representation
Data Access Ÿ Standards Ÿ Data Strategy ŸTraining & Awareness Ÿ Partnership Management
*Statistics above are for illustrative purposes only
Deeper and more sophisticated scientific insights in patients, medicines & disease.
30%
8
Opportunities to extract scientific insights using
Data Science and AI exist across R&D
Genomics Personalised
Medicine
Disease
Understanding
Drug
Design &
Synthesis
Imaging
Deeper and more sophisticated scientific insights in patients, medicines & disease.
Clinical
1 2
14 May 2020Name
9
Putting it all together
Data sources and core systems combined create a data backbone upon which we can leverage AI based
capabilities.
Name
10
This is the place where data science and AI impact lives
>> Our Mission in the Data Science & AI team is to collaborate across R&D to drive innovation
through data science and AI.
Improving our
understanding
of disease and
uncovering
new targets
Transforming
R&D
processes
Speeding the
design and
delivery of
new medicines
for patients
>> Our Vision is that by 2025, data science and AI will have transformed R&D, enabling AZ to
accelerate the delivery of the most life changing medicines to patients.
11 CONFIDENTIAL
Developing standards, governance & policies, ensuring trust, privacy and security in
data.
Processing, formatting, profiling, structuring, capturing meaning in, and
relationships between data.
Creating tools and techniques to extract value, make decisions, report, analyse and
act on data.
Investing in education for all, data science communities, job families, studentships,
external comms.
>>
>>
>>
We use a simple framework to drive innovation in
Data Science & AI
Control
Organise
Insight
Learning>>
12
Hub provides strong central capability support, while R&D
functions are spokes providing insights and more.
• Data management,
standards & policies
• Tools & Platforms
• Education &
Awareness
Control
Organise
Insight
Insight
Insight
Insight
Insight
Insight
Insight
Insight
Insight
Learning
Science Data
Foundation
13
The challenge: Access to high quality data is our life blood, yet today
R&D teams cannot rapidly access and exploit it for re-use
DATA WE OWN TODAY EMERGING DATA SOURCES, OWNED BY OTHERS
We don’t know what
we have or where it is We only use it once
We can’t compare
or combine it
We don’t know
what’s valid
AZ clinical trial data
(23,000 studies) &
imported clinical data Biomarker data
Anonymized
external data
CGR Genomic Data
Medical image data
Real-World Evidence DataScreening and
Assay data
Open by default – compliant by design – insights by your deadline
BioPharmaceuticals R&D For Internal Use Only15
What it is:
ü Collaborative programme between Science IT,
DS&AI and R&D
ü Building enduring capabilities for storing and
connecting data sources in a compliant way
üA change management programme encouraging
data capture and tagging for re-use
ü Analytics-ready data for ML and AI, the tools,
processes and compute environments to drive
scientific insight
Science Data Foundation: Democratising data with re-use in mind
What it isn’t:
✕ One-time effort
✕ Clean-up effort across all R&D
data
BioPharmaceuticals R&D For Internal Use Only16
Science Data Foundation: A common
way to manage R&D data
Master
Data:
Common
Language
Workflow(s)
Sources
Workflow(s)
Sources
Workflow(s)
Sources
Workflow(s)
Sources
Workflow(s)
Sources
Workflow(s)
Sources
Workflow(s)
Sources
Indexing
Sources
Indexing
Sources
Indexing
Sources
Biological Insights
Knowledge Graph
Data Catalogue Data selection for AI
AI
Orchestration
AI
Algorithms
Metrics & Rules (Marts)
Reports & Dashboards
‘Fact’ Discovery (NLP)
Analytics
Data
SAR
Data
Reaction
Data
Imaging
Data
Metadata Metadata Metadata Metadata
Science Data Foundation
Biomedical Research DataDrug Design Data
Patient
Data
Metadata
Omics
Data
Metadata
Real World
Data
Metadata
Literature
Metadata
AZ
Documents
Metadata
Comp
Intelligence
Metadata
Upstream
Processing
Down
Stream
Analysis
17
SDF Programme Outline
Vision
All scientific decision-
making in AstraZeneca
R&D is supported by or
improved through the
application of data
science.
Goal Strategy ObjectivesA scalable and enduring
scientific data supply-chain
is founded comprising both
technology and services,
through which data is
made ‘analytics-ready’
accessible to users through
a seamless ‘intent to
insight’ workflow.
Ø Build and operate platforms for
hosting at least four key analytical
data types, that make data ‘FAIR’.
Ø Data interconnections support
cross-domain exploration and
analytics.
Ø Tools and services to support data
science workflows are created.
Ø Data-use is compliant-by-default
due to data governance wrappers.
R&D data operations and IT
platforms will be co-created
between Data Science and
AI R&D business units and
Science IT to be operated
as enduring capabilities
with a focus on making data
‘FAIR’.
01
SDF’s biggest tangible value contribution will be to accelerate
innovative science through direct enablement of data
science workflows and programmes designed to introduce
data-driven decision-making,
Accelerate efforts in AI, data and digital
02
03
- SDF is a key enabler of AZ’s Growth
Through Innovation Strategy
SDF Strategic Drivers Data lies at the heart of scientific workflows. By
democratizing data through SDF, we will change
our culture to one that is more collaborative and
truth-seeking, where decisions are data-driven and
where we increasingly perform as an enterprise
team.
Advance our culture
Through creation of an enduring data
supply-chain, SDF will increase AZ’s agility
to: take advantage of new data analysis
methodologies and technologies;
incorporate and drive value from new data
sources; and actively govern and manage
data in response to changing ethical and
legal requirements.
Build and adapt
capabilities for the future
Science Data Foundation (SDF)
Goals
A foundational programme to enable the Growth
Through Innovation Strategy.
Create an enduring supply-chain of data of
various types and across the discovery and
development pipeline that will drive
scalable, and efficient data science
operations.
Generate analytics-ready data
Create an efficient and seamless
experience throughout the chain of
activities scientists undertake to undertake
data science. From planning projects,
obtaining the data they need, performing
analyses using powerful tools to finally
applying new insights systematically and at
scale, into R&D pipelines.
Seamless intent to insight
Introduce governing principles, supported
by technology, to minimise risk of data
misuse by ensuring compliance to internal
and external policies. This shall allow
scientists to focus on innovative science,
guided through compliant ‘paved-paths’ to
R&D data.
Ensure compliance by default
Relationships Between SDF Goals
Data sources: Operational systems, other
data platforms, instruments and external
sources
Analytics Ready Data
Intent to insight
The analytics-ready
data goal will take data
from sources,
standardise, integrate
and enrich the
information. This is then
supplied into the intent
to insight data-usage
process.
The intent to insight process will create a
seamless data usage process that provides a
compliant by default path to data use and
analytics.
Compliance by defaultThe compliance by
default goal will act to
help define or update
policies, assert that the
policies adhere to
external regulations and
ensure that the intent
to insight process
applies
SDF Programme Structure
SDF Leadership
SDF Change & Comms
Sources Workflows
SDF-Core Data Platform
SDF- Data Policy & Governance
SDF-Data Find & Integrate
SDF-Data Science
StorageIngestion Curation ExplorationAccess Analysis
SDF Capability Enabling Workstream
Cross-data-type SDF Workstream
Cross data-type data-
management, -quality
and -usage policies
defined. Scientific data
management platforms
setup (e.g., reference
data management, data
catalogue). Governing
procedures that apply
policies to SDF
processes.
Provide cross-cutting
capabilities that
enable all data-type
workstreams to
develop against a
consistent, supported
data foundation.
SDF Analytical Data-Type Workstream
Analytical data type workstreams (ADD, Patient, Omics, Imaging) will prepare and process data and meta-data for ingestion into the core data platform. In doing so: the data will
become accessible according to standard policies and access mechanism alongside other data types; Standard patterns of exploration and data science will be enabled,
although data-type workstreams are required to develop highly data-specific exploration and modes of analysis (e.g., genome browser and ‘omic variant analysis for SDF-Omics).
SDF Workstreams
SDF Data Workstreams:
ADD; Patient; Omics; Imaging
Goal: Generate Analytics Ready Data
Ingest and
cataloguing
Standardise
and improve
quality
Curation and
enrichment
Data Hosting
ü Reduce the time taken and effort by data
scientists to assemble data into a single
place.
ü Reduce costs associated with lost
innovation opportunities due to scientific
data being unfindable, unusable or
inaccessible to analytics toolsets.
Data availability
triggers
automated flow
into hosting
environment
Automated
cataloguing of data on
ingest is an enabler of
findability
Data ingestion can be
templated to ensure
new data sources
have low barrier to
also becoming
hosted.
Data can be
‘cleaned up’ by
applying
standardisation of
key terms and
identifiers. Data
quality can be
measured to help
ultimate consumers
plan their analyses.
Enrichment of
information and
metadata can be
both automated and
expert-driven to
create greater data
reuseability and thus
value.
All data and
metadata is hosted
through an
accessible
environment so that
information
discovery and
analytics tools can
gain systematic (yet
secure) access.
Full track and trace
and monitoring of a
maximally
automated process
supports content
reporting and
auditability.
Target SDF Capabilities
An enduring supply-chain of data of various types and across the
discovery and development pipeline that will drive scalable, and
efficient data science operations. Key requirements are:
• Data quality and completeness
• Machine readable metadata
• A hosting environment that can be support access by other
systems
Description Benefit Strategy
Goal: Seamless Intent to insight
Ideation using
Information
Discovery
Register intent
and make
data request
Data is
provisioned
Analysis and
insights
Application of
insight
ü Reduce time and effort to generate and administer intent to
insight activities allows greater scale and lower cost to reuse
data.
ü Reduce wasted effort associated with scientifically flawed or
non-compliant data reuse requests.
ü Increase analytics capabilities to drive innovation
ü Improve experience and job satisfaction
Single point of
entry to
simplify and
lower barriers
to data reuse.
Powerful and
intuitive
information
discovery tools
and connection to
other experts to
enable scientific
ideation
Intent, data and
analysis
requirements
captured and issued
electronically to
ensure governance
with minimal
administration effort
Data provided to an
analysis team in the
desired format and
analysis
environment. Data
compliance and
security are by
default. Bespoke
data products also
supported
Powerful analysis
environments to
support data science
& AI workflows.
Insights are
captured and
traceable to
requests
QA triggered for
investment decisions
and external
publications. Insights
with potential as BAU
decision-support
processes will trigger
further creation of
productionised data-
analytics pipelines
Full track and trace
and monitoring of a
maximally
automated process
supports audit and
process
improvement
Target SDF Capabilities
The chain of activities scientists undertake to
plan data science projects, obtain the data
they need, perform analyses and finally apply
new insights systematically and at scale into
R&D pipelines.
Description Benefit Strategy
Goal: Ensure compliance by default
Ethical and
legal
frameworks
Manage data
standards
Securing our
data
Training
ü Reduce likelihood of fines associated to legal or
ethical misuse of data.
ü Reduce the burden on scientists to become
compliance experts and allow them more time to
focus on science, leading to increased innovation-
based revenue generation.
Frameworks that are
built into systems and
processes are fit for
innovation purposes;
balancing potentially
changing restrictions
that prevent misuse
with enablement of
data science.
Information that
supports ethical and
legal data reusability is
machine readable and
can be efficiently
managed by the Data
Office.
Host systems are
secure from cyber
attack and only allow
users to perform
operations such as
data access, copy or
movement without
increasing risk.
Target SDF Capabilities
Governing principles, supported by technology, to
minimise risk of data misuse by ensuring
compliance to internal and external policies. This
shall allow scientists to focus on innovative science,
guided through compliant ‘paved-paths’ to R&D
data.
Description
Provide training on
processes and
systems so that
compliant paths to
request and access
data are known.
Compliance
monitoring
Active compliance
monitoring to
provide early
warning of risks
associated to data
reuse, helping to
target training and
remediation.
Benefit Strategy
Ideation &
discovery
Intent &
Request
Data
Provisioning
Analysis
Application
of insight
As a scientist the Data Office
provides me with a single
point of entry to begin a
process to exploit our data
and the information and
data exploration tools to
drive my scientific creativity
and ideation.
I am able to find and
request data online. The
Data Office is on hand to
advise me on issues of
compliance and they also
help to put me in touch with
other experts. From the point
of creating my request, I can
follow the process easily.
Whether you are an
expert in AI, visual
informatics, or have more
scientific than IT
expertise, Data Office will
help you get your data to
the right analysis
environment, including
cutting-edge cloud
environments.
Data office helps to ensure the
right quality processes are
triggered for investment
decisions and external use,
meaning that we customers
can focus more on the science.
When we’ve generated a
promising new exploratory
model that could be
productionised to drive real
value, Data Office will help us
‘productionise’ the data flow
alongside our IT colleagues.
Data office can get the data
to you in a format you need
and to a place where you can
perform your analysis in a
compliant and secure way.
This ranges from systematic
data flows to bespoke ‘data
products’.
‘Intent to Insight’ – the process experienced by our
customers.
Data office
provides a single
point of entry for
gaining access to
data
Expert support and
maximal automation
through the process
ensures efficiency
yet data
compliance by
default
End of 2020 target
Deliverables,
benefits and
next steps
New Target Biology
AI-driven Lead Optimisation
Driving re-use of clinical
data: 1000 studies in 2019,
1 million patients in 2020
Erik Schultes, PhD
International Science Coordinator

GO FAIR International Support and Coordination Office 

Leiden Center for Data Science

erik.schultes@go-fair.org
https://www.go-fair.org
http://orcid.org/0000-0001-8888-635X
FAIR Digtial Objects
for
Automating Processes
14 May, 2020
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship.
Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016).
Automating F, A, I and R
https://www.go-fair.org/today/fair-digital-framework/
Paris, October 28-29, 2019
RDA / GEDE: FAIR Digital Objects
https://fairdo.org
FAIR Digital Objects
Based on Bonino 2019
minimal open standard
linking the FDO components
‘everything else’
TBA
FAIR Digital Objects
Based on Bonino 2019
minimal open standard
linking the FDO components
‘everything else’
TBA
1) GUPRI resolution service
2) Recursive FDO construction
minimal open standard
linking the FDO components
‘everything else’
TBA
Machine-
actionable
atom-to-
atom
configuration
FAIR Molecule
a FAIR Digital Object for molecular structure
A minimal standard* for a machine-actionable** representation of
molecular structure*** that can be the basis of organizing other
heterogeneous (meta)data****.
* Easy to follow, encourages voluntary adoption
** FAIR
*** Foundational concept in the chemistry domain
*** Knowlet-like clusters of assertions about molecular structure
FAIR Molecule
a FAIR Digital Object for molecular structure
Why FAIR Molecules?
• Chemical view of the world is ubiquitous (example: biomedicine)
• Chemical data is vast and complex
• Rate of chemical data production is vast and growing
• FAIR solutions are welcomed
FAIR Molecule Hackathon
21 & 22 January, 2020
Hamburg
https://osf.io/ft6wn/
Tuesday January 21
• 13:00 Lunch
• 14:00 Welcome / Overview (Erik)
• 14:30 Participants Introductions
- Rajaram / Kees / Luiz (FDO)
- Yuliia / Alessa / Nicola (Molecular Structure)
- Robert / Barbara / Stuart (Concpetual Models)
- Hao / Folkert (DataBiology)
- Myles / Erik (use cases)
- John / Robert (Launch Pads)
• 16:00 Break
• 16:20 Task Organization / Discussion
• 18:00 Pizza dinner
• 19:00 Continue as desired
• 22:00 ZBW doors close
Hackathon Agenda
Tuesday January 21
• 13:00 Lunch
• 14:00 Welcome / Overview (Erik)
• 14:30 Participants Introductions
- Rajaram / Kees / Luiz (FDO)
- Yuliia / Alessa / Nicola (Molecular Structure)
- Robert / Barbara / Stuart (Concpetual Models)
- Hao / Folkert (DataBiology)
- Myles / Erik (use cases)
- John / Robert (Launch Pads)
• 16:00 Break
• 16:20 Task Organization / Discussion
• 18:00 Pizza dinner
• 19:00 Continue as desired
• 22:00 ZBW doors close
Hackathon Agenda
Tuesday January 21
• 13:00 Lunch
• 14:00 Welcome / Overview (Erik)
• 14:30 Participants Introductions
- Rajaram / Kees / Luiz (FDO)
- Yuliia / Alessa / Nicola (Molecular Structure)
- Robert / Barbara / Stuart (Concpetual Models)
- Hao / Folkert (DataBiology)
- Myles / Erik (use cases)
- John / Robert (Launch Pads)
• 16:00 Break
• 16:20 Task Organization / Discussion
• 18:00 Pizza dinner
• 19:00 Continue as desired
• 22:00 ZBW doors close
Hackathon Agenda
Tuesday January 21
• 13:00 Lunch
• 14:00 Welcome / Overview (Erik)
• 14:30 Participants Introductions
- Rajaram / Kees / Luiz (FDO)
- Yuliia / Alessa / Nicola (Molecular Structure)
- Robert / Barbara / Stuart (Concpetual Models)
- Hao / Folkert (DataBiology)
- Myles / Erik (use cases)
- John / Robert (Launch Pads)
• 16:00 Break
• 16:20 Task Organization / Discussion
• 18:00 Pizza dinner
• 19:00 Continue as desired
• 22:00 ZBW doors close
Hackathon Agenda
Tuesday January 21
• 13:00 Lunch
• 14:00 Welcome / Overview (Erik)
• 14:30 Participants Introductions
- Rajaram / Kees / Luiz (FDO)
- Yuliia / Alessa / Nicola (Molecular Structure)
- Robert / Barbara / Stuart (Concpetual Models)
- Hao / Folkert (DataBiology)
- Myles / Erik (use cases)
- John / Robert (Launch Pads)
• 16:00 Break
• 16:20 Task Organization / Discussion
• 18:00 Pizza dinner
• 19:00 Continue as desired
• 22:00 ZBW doors close
Hackathon Agenda
Tuesday January 21
• 13:00 Lunch
• 14:00 Welcome / Overview (Erik)
• 14:30 Participants Introductions
- Rajaram / Kees / Luiz (FDO)
- Yuliia / Alessa / Nicola (Molecular Structure)
- Robert / Barbara / Stuart (Concpetual Models)
- Hao / Folkert (DataBiology)
- Myles / Erik (use cases)
- John / Robert (Launch Pads)
• 16:00 Break
• 16:20 Task Organization / Discussion
• 18:00 Pizza dinner
• 19:00 Continue as desired
• 22:00 ZBW doors close
Hackathon Agenda
Goal:
Show FAIR interoperation between data & code
Hackathon Agenda
Resolves to
GUPRI
ePIC
FAIR Digital Object Record
fdo:digitalObjectOfType fdo:MGFile ;

fdo:locationOfDO <https://hackathon.fair-dtls.surf-hosted.nl/EL/> ;

datacite:hasIdentifier :identifier ;

dct:conformsTo <https://hackathon.fair-dtls.surf-hosted.nl/shacl-record.ttl> .
fdof:hasResourceLocation
Resource
fdo:digitalObjectOfType
Type
MG File
fdof:hasMetadata fdof:isMetadataOf
Extensible Metadata
# metadata section

#<http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID702> ; # Ethanol

#<http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID5280450> ; # Lineoleic acid

#<http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID5282184> . # Ethyl Lineolate

:elMetadata :respresents :molecule .

:molecule :molecularWeight "308.47"^^:gramsPerMol ;

skos:prefLabel "Ethyl Lineolate" ;

skos:notation "C20H36O2" ;

:cas "544-35-4" ;

<http://semanticscience.org/resource/SIO_000212> <http://dx.doi.org/10.1002/
anie.201801332> ; 

# is referred to by :availableAt <https://www.sigmaaldrich.com/catalog/search?
term=ethyl+linoleate&interface=All&N=0&mode=match%20partialmax&lang=en&regi
on=US&focus=product> .

# provenance

:elMetadata dct:contributor orcid:0000-0002-8042-4131 .

orcid:0000-0002-8042-4131 a foaf:Person ;

foaf:name "Myles Axton" ;

pro:holdsRoleInTime [

a pro:RoleInTime ;

pro:withRole scoro:investigator-role ;

] .
• GUPRI

• FDO Record

• Type - Molecular Graph 

• Extensible Metadata

• Resource - molecular structure
• GUPRI

• FDO Record

• Type - .mol
• Extensible Metadata

• Resource - molecular structure
• GUPRI

• FDO Record

• Type - File conversion script 

• Extensible Metadata

• Resource - Docker image
FAIR Molecule 1 FAIR Molecule 2FDO for scripts
FAIR Molecule Hackathon
• GUPRI

• FDO Record

• Type - Molecular Graph 

• Extensible Metadata

• Resource - molecular structure
• GUPRI

• FDO Record

• Type - .mol
• Extensible Metadata

• Resource - molecular structure
• GUPRI

• FDO Record

• Type - File conversion script 

• Extensible Metadata

• Resource - Docker image
FAIR Molecule 1 FAIR Molecule 2FDO for scripts
FAIR Molecule Hackathon
FDO Orchestration
FAIR Molecule
Established Knowledge - chemical informatics
Real World Observations - lab automation
Virtual World Observations - computer simulations
chemify.org
FAIR Molecules as Digital Twins
https://www.manufacturingleadershipcouncil.com/2019/12/02/digital-twins/
FAIR Molecules as Digital Twins
FAIR Molecules as Digital Twins
chemify.org
FAIR Molecule
Drug candidates
for COVID-19
FDO Hackathon
https://docs.google.com/document/d/1rhUeMmdIf7khn5XAgLW0oBpqp81kjtmmxEC6queRl5A/edit?usp=sharing
FDO Hackathon
https://docs.google.com/document/d/1rhUeMmdIf7khn5XAgLW0oBpqp81kjtmmxEC6queRl5A/edit?usp=sharing
https://www.go-fair.org/today/FAIR-funder/
Convergence
Resource 1
Resource 2
Resource 3
Resource 4
Resource 5
Resource 6
Resource 7
Resource 8
F
A
I
R
0 1 0 0 0 0 0
1 1 1 0 1 1 1
1 1 0 0 0 0 0
0 0 1 1 1 1 1
1 0 0 0 0 0 0
0 1 1 0 1 1 1
1 0 0 0 0 0 0
1 0 1 0 1 1 1
Communities
Resources
FAIR Implementation Profiles
Convergence Matrix http://www.data-intelligence-journal.org/p/47/
Reusing FIPs https://osf.io/8sv5f/
Convergence
• FIPs are reusable = drives convergence
• FIPs guarantee interoperation
• FIPs inform data stewardship plans
FIPs are the DNA of the DMP
Convergence Matrix http://www.data-intelligence-journal.org/p/47/
Reusing FIPs https://osf.io/8sv5f/
Convergence
Resource 1
Resource 2
Resource 3
Resource 4
Resource 5
Resource 6
Resource 7
Resource 8
F
A
I
R
0 1 0 0 0 0 0
1 1 1 0 1 1 1
1 1 0 0 0 0 0
0 0 1 1 1 1 1
1 0 0 0 0 0 0
0 1 1 0 1 1 1
1 0 0 0 0 0 0
1 0 1 0 1 1 1
Communities
Resources
Pharma Industry Challenge:
Develop a common pre-competative FIP
Thank you
&
Questions
3 communities building FAIR distributed learning platforms
“FAIR Data Trains”
Barbra Magagna, Umweltbundesamt GmbH
Kristina Hettne, CDS University Library
3 communities building FAIR distributed learning platforms
“FAIR Data Trains”
Barbra Magagna, Umweltbundesamt GmbH
Kristina Hettne, CDS University Library
Choice
Challenge
3 communities building FAIR distributed learning platforms
“FAIR Data Trains”
Barbra Magagna, Umweltbundesamt GmbH
Kristina Hettne, CDS University Library
Choice
Challenge
A
F
R
I
FAIR automation and
FAIR applications
May 2020
Georges Heiter
Copyright ©2020. All Rights Reserved. Confidential Databiology Ltd.
AUTOMATION
Copyright ©2020. All Rights Reserved. Confidential Databiology Ltd.
Humans are manually involved in every step of the research process
Bulk of energy is still spent on finding and preparing data for analysis
Metadata about digital assets and the operations upon them is mostly not being
captured
→ research not easily repeatable or automatable
Copyright ©2020. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 4
Data Usage Challenge (making data actionable)
Data
Source
Model 1
Data Source
Model 2
Data Source
Model n
Analysis Data
Model 1
Analysis Data
Model 2
Analysis Data
Model n
Knowledge
Network
Data
Source
Model
Narrow
scope
Specialized Use case specific
and/or
Proprietary
Domain Scope Flexibility
Knowledge
Network
Multi-domain Broad &
standardized
Growing/changing
Analysis
Data
Model
Cross-domain Specialized Use case specific
and/or
Proprietary
Copyright ©2020. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 5
Machine Actionable Components as Foundation of the Knowledge Network
Copyright ©2020. All Rights Reserved. Confidential Databiology Ltd.
▪ Findable with unique ID, and digitally signed
▪ Accessible in an associated permanent registry
▪ Interoperable because they rely on standards
▪ Reusable as self-contained and fully portable
▪ Software integrity and quality
▪ Customizable
Page ▪ 6
What makes an application FAIR?
Copyright ©2020. All Rights Reserved. Confidential Databiology Ltd.
App metadata
− Name
− Version
− Author
− Description
− Inputs
− Outputs
− Parameters
− License
− Original source
− Reference data dependencies
Page ▪ 7
CIAO App – software packaged with metadata
https://hub.databiology.net/app-dbio-blast/tags/2.9.1
docker pull hub.databiology.net/app-dbio/blast:2.9.1
App are stored and distributed in a repository with unique id:
Metadata is integrated in the container
CIAO app
Code
Aux
Metadata
Copyright ©2020. All Rights Reserved. Confidential Databiology Ltd.
Sets
Page ▪ 8
CIAO app instance
Links the app to an infrastructure and organizational context
Defines
Copyright ©2020. All Rights Reserved. Confidential Databiology Ltd.
The workunit will keep record of:
Page ▪ 9
CIAO App run
App instance execution
record
− App instance used
− App status
− Inputs and outputs
− Execution versions
− Execution logs
− Infrastructure used
− Keeps data
provenance
Copyright ©2020. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 10
CIAO apps evolution – progressive layering of metadata to make apps FAIR
CIAO app
Code
Aux Data
Metadata
CIAO app instance
Storages
Compute
Security
CIAO app run (Workunit)
Inputs
Outputs
Parameters
Policies
Logs
Versions
CIAO app
Code
Aux Data
Metadata
CIAO app instance
Storages
Compute
Security
Policies
CIAO app
Code
Aux Data
Metadata
Copyright ©2020. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 11
Machine actionable policies and secrets
Policy based
Consent
Management
▪ Policies make use of metadata
− Example: Define consent tags on studies, datasets and entities
− Key operations on data, applications and infrastructure subject to
policy
− Granularity vs scalability
▪ Stand-alone Policy service
− System landscape enforces policies managed in policy service
−OPA (https://www.openpolicyagent.org/)
▪ Stand-alone secret management
− Facilitation of security workflows
Copyright ©2020. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 12
Composability
Copyright ©2020. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 13
Databiology Approach – Intelligent Automation powered by a Knowledge Network that
converges Data, Applications, Infrastructure and Organizations
Knowledge Network Intelligent Automation
INFRASTRUCTURE
Source1
Source2
Sourcen
DATA
App1 AppnApp2
APPLICATIONS
Compute Site
Application
Orchestration
Engine
Knowledge
Engine
PEOPLE,ORGS&POLICIES
Data
Copyright ©2020. All Rights Reserved. Confidential Databiology Ltd.
Data Modeling
▪ Entity Definition / MDS
▪ Terminology Service (Ontologies)
▪ Policy Service
Page ▪ 14
Knowledge Engine
Data Discovery
▪ Search
▪ Ontology Mapping
▪ Collection Management
▪ Federated Search
Data Ingestion
▪ Aggregation
− Multi-Channel (Batch / Stream)
− Enrichment
− Validation
− Origination (Lineage / Provenance)
− Persistence
▪ Federation
− Federated Data Sources
− Origination (Lineage / Provenance)
− Caching
Copyright ©2020. All Rights Reserved. Confidential Databiology Ltd.
▪ Secret Management
− Secure Credential Store
− Security workflows to provide secrets to
orchestration processes
▪ Workunit Management
− Inspection (Real-time monitoring)
− Monitoring & Logging
− State control (Real-time)
▪ Compute Capacity Management
− Provisioning /Deprovisioning Dynamic Capacity (VMs)
− Cloud Providers
− On-premise technologies
▪ Data Orchestration
− Data Transport (unstructured)
o Transfer Protocols
− Data Projections (entity data)
o Covers Ingress and Egress entity data
▪ Application Orchestration
− Application Registry
− Application Transport
− Dynamic Proxying of Interactive Apps to user browser
Page ▪ 15
Orchestration Engine
Copyright ©2020. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 16
Example: Contextually aware research assistant delivers intelligent automation
Automatically routes to
and executes analysis
on the most suitable
infrastructure
Automatically extracts
insights and feeds them
back into the
knowledge graph
Suggests analysis apps based
on contextual data, including
the user’s data selection and
their previous analysis runs
Copyright ©2020. All Rights Reserved. Confidential Databiology Ltd.
Automation will free researchers to focus on higher level tasks
Let machines will take over manual labor intensive functions to allow researchers to focus on ideation and creativity
for LOWER COST PER INSIGHT
Copyright ©2020. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 18
Intelligent Automation Value
Increase Researcher effectiveness and efficiency
• Discover more across a federated knowledge network and collaborate securely
• Automation and AI allow researchers to focus on the science instead of the IT
• Always use best in class analytics tools to get the most insights out of data
Mitigated Risk
• Automatic audit trail, provenance and reproducibility
• Future-proof due to no technology stack lock-ins (composability side effect)
• Lasting data integration and interoperability (knowledge network)
Lower the cost per insight
• Achieve higher levels of automation -> contextual aware assistance
• Eliminate duplication of effort
• Speed to value measured in weeks not years
Copyright ©2020. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 19
Do You Have
Any Questions?
Expert Panel
1
©PistoiaAlliance
Prepared Questions
2
1. What are different flavours of FAIR implementation and
application for Life Science industry?
2. What is the low hanging fruit and likely challenges of FAIR
implementation and application by Life Science industry?
3. What would a common specification for FAIR digital objects look
like? Why is this important for Life Science industry?
Questions from the audience
Get Involved!
Join the FAIR Implementation project
Ian Harrow
Ian.harrow@pistoiaalliance.org
Membership:
membership@pistoiaalliance.org
General Enquiries:
Zahid Tharia – zahid.tharia@pistoiaalliance.org
www.pistoiaalliance.orgwww.pistoiaalliance.org
Next Webinar
Lab of the Future
Thurs, May 21st, 2020, 14:30 – 16:00 BST
www.pistoiaalliance.org/webinars-2020/
info@pistoiaalliance.org @pistoiaalliance www.pistoiaalliance.org

Fair by design

  • 1.
    Pistoia Alliance Webinar FAIRby Design 14th May 2020 15.00 to 16.30 BST
  • 2.
    This webinar isbeing recorded
  • 3.
    Audience Q&A Please usethe questions box
  • 4.
    Introduction Ian Harrow, ProjectManager Pistoia Alliance
  • 5.
    ©PistoiaAlliance Themes and Objectives 5 •To position FAIR as a key enabler to automate and accelerate R&D process workflows • FAIR Implementation by design within the context of a use case • Grounded in precise outcomes – e.g. faster and bigger science / more reuse of data to enhance value / increased ability to share data for collaboration and partnership • To make data actionable especially through FAIR interoperability
  • 6.
    ©PistoiaAlliance Bios 6 Mathew Woodwark • Headof Data Infrastructure, Tools, Data Science and AI at AstraZeneca • Experienced Informatics and Information Management professional, with an established track record of delivery • Combines biological understanding, organisational psychology and technical skills to managing a wide portfolio of complex Informatics projects Erik Schultes • International Science Coordinator for the GO-FAIR International Support and Coordination Office in Leiden, The Netherlands. • Previously held appointments at Duke University Medical Center and Leiden University Medical Center. • Erik has worked on data intensive projects within academia and the private sector. Georges Heiter • Founder & CEO for the Databiology • Provides biomedical information management and orchestration for the life sciences and healthcare sectors. • Enables global distribution of biomedical data, applications and infrastructure.
  • 7.
    ©PistoiaAlliance Agenda 7 Time (BST) Title Presenter 15:00 Introductions& housekeeping Ian Harrow, Pistoia Alliance 15:05 Case Study: AZ’s Science Data Foundation Mathew Woodwark 15:25 FAIR digital objects for automating processes Erik Schultes 15:45 FAIR automation workflows and applications Georges Heiter 16:05 Panel All speakers Moderator: Ian Harrow 16:30 Close
  • 8.
    AstraZeneca’s Science DataFoundation: Analytics-ready data for machine learning and AI Mathew Woodwark Head of Data Infrastructure and Tools Data Science and AI Pistoia European Conference, 11th March
  • 9.
    AstraZeneca generates andhas access to more data than ever before. Target ID Target Validation Discovery Pre- Clinical Clinical Commerci al Post Marketing Surveillanc e Genetic & Genomic Data Patient-Centric Data Sensors & Smart Devices Interactive Media Healthcare Information network Market Data
  • 10.
    3 A concerted effortis required to shape and govern data, transforming it into a strategic asset. From disconnected internal databases and external sources to data that is FAIR: Findable, Accessible, Interoperable, Reusable ADME Imaging RWE In Vivo Biology In Silico Clinical Trial Phenotypic Screens HTS Genomic Pharmacology Toxicology Biomarkers Efficacy Literature Chemistry
  • 11.
    Genomics EHR The way weanalyse data is changing. Connected data allows us to unleash the power of AI. Today Security/privacy is a key consideration <2 years >5 years INDIVIDUAL DATA TYPES CONNECTION OF DATA TYPES ALGORITHMIC INTELLIGENCE Genomics Sensor/ smart EHR Market Interactive media Sensor/ smart Market Interactive media
  • 12.
    5 Data Science usesscientific methods, processes and AI algorithms to extract insights from these data. Artificial Intelligence Any process, task or decision where computerised technology may in some way mimic and/or replace human intelligence. Machine Learning Using algorithms to give a computer system the ability to ‘learn for itself’ deriving patterns and rules from data it is exposed to, as opposed to explicit programming. Manual feature extraction Deep Learning A type of machine learning mimicking the dense set of interconnections in our brains. 1950 1980 2010 Automated feature extraction
  • 13.
    6 Big Data / Cognitive Computing Robots/ automation Sensors / IoT NLP / NLG / NLU Computer vision / image processing Neural networks / deep learning Statistical / machine learning Chatbots / assistants AI is a diverse and constantly changing set of disciplines. AI is any process, task or decision whereby a computerised technology may in some way mimic and/or replace human intelligence.
  • 14.
    7 Opportunities to extractscientific insights using Data Science and Artificial Intelligence (AI) exist across R&D. Target identification less attrition Trial Optimization faster and more efficient Imaging less time Personalised Medicine the right medications for the right patients Clinical real-time data innovative trials 10% 30% Machine Learning Ÿ Visual Analytics Ÿ Advanced Statistics Ÿ Neural Networks Ÿ Data Exploration Signal processing Ÿ Natural Language Processing Ÿ Math. Modeling Ÿ Knowledge Representation Data Access Ÿ Standards Ÿ Data Strategy ŸTraining & Awareness Ÿ Partnership Management *Statistics above are for illustrative purposes only Deeper and more sophisticated scientific insights in patients, medicines & disease. 30%
  • 15.
    8 Opportunities to extractscientific insights using Data Science and AI exist across R&D Genomics Personalised Medicine Disease Understanding Drug Design & Synthesis Imaging Deeper and more sophisticated scientific insights in patients, medicines & disease. Clinical 1 2
  • 16.
    14 May 2020Name 9 Puttingit all together Data sources and core systems combined create a data backbone upon which we can leverage AI based capabilities.
  • 17.
    Name 10 This is theplace where data science and AI impact lives >> Our Mission in the Data Science & AI team is to collaborate across R&D to drive innovation through data science and AI. Improving our understanding of disease and uncovering new targets Transforming R&D processes Speeding the design and delivery of new medicines for patients >> Our Vision is that by 2025, data science and AI will have transformed R&D, enabling AZ to accelerate the delivery of the most life changing medicines to patients.
  • 18.
    11 CONFIDENTIAL Developing standards,governance & policies, ensuring trust, privacy and security in data. Processing, formatting, profiling, structuring, capturing meaning in, and relationships between data. Creating tools and techniques to extract value, make decisions, report, analyse and act on data. Investing in education for all, data science communities, job families, studentships, external comms. >> >> >> We use a simple framework to drive innovation in Data Science & AI Control Organise Insight Learning>>
  • 19.
    12 Hub provides strongcentral capability support, while R&D functions are spokes providing insights and more. • Data management, standards & policies • Tools & Platforms • Education & Awareness Control Organise Insight Insight Insight Insight Insight Insight Insight Insight Insight Learning
  • 20.
  • 21.
    The challenge: Accessto high quality data is our life blood, yet today R&D teams cannot rapidly access and exploit it for re-use DATA WE OWN TODAY EMERGING DATA SOURCES, OWNED BY OTHERS We don’t know what we have or where it is We only use it once We can’t compare or combine it We don’t know what’s valid AZ clinical trial data (23,000 studies) & imported clinical data Biomarker data Anonymized external data CGR Genomic Data Medical image data Real-World Evidence DataScreening and Assay data
  • 22.
    Open by default– compliant by design – insights by your deadline BioPharmaceuticals R&D For Internal Use Only15
  • 23.
    What it is: üCollaborative programme between Science IT, DS&AI and R&D ü Building enduring capabilities for storing and connecting data sources in a compliant way üA change management programme encouraging data capture and tagging for re-use ü Analytics-ready data for ML and AI, the tools, processes and compute environments to drive scientific insight Science Data Foundation: Democratising data with re-use in mind What it isn’t: ✕ One-time effort ✕ Clean-up effort across all R&D data BioPharmaceuticals R&D For Internal Use Only16
  • 24.
    Science Data Foundation:A common way to manage R&D data Master Data: Common Language Workflow(s) Sources Workflow(s) Sources Workflow(s) Sources Workflow(s) Sources Workflow(s) Sources Workflow(s) Sources Workflow(s) Sources Indexing Sources Indexing Sources Indexing Sources Biological Insights Knowledge Graph Data Catalogue Data selection for AI AI Orchestration AI Algorithms Metrics & Rules (Marts) Reports & Dashboards ‘Fact’ Discovery (NLP) Analytics Data SAR Data Reaction Data Imaging Data Metadata Metadata Metadata Metadata Science Data Foundation Biomedical Research DataDrug Design Data Patient Data Metadata Omics Data Metadata Real World Data Metadata Literature Metadata AZ Documents Metadata Comp Intelligence Metadata Upstream Processing Down Stream Analysis 17
  • 25.
    SDF Programme Outline Vision Allscientific decision- making in AstraZeneca R&D is supported by or improved through the application of data science. Goal Strategy ObjectivesA scalable and enduring scientific data supply-chain is founded comprising both technology and services, through which data is made ‘analytics-ready’ accessible to users through a seamless ‘intent to insight’ workflow. Ø Build and operate platforms for hosting at least four key analytical data types, that make data ‘FAIR’. Ø Data interconnections support cross-domain exploration and analytics. Ø Tools and services to support data science workflows are created. Ø Data-use is compliant-by-default due to data governance wrappers. R&D data operations and IT platforms will be co-created between Data Science and AI R&D business units and Science IT to be operated as enduring capabilities with a focus on making data ‘FAIR’.
  • 26.
    01 SDF’s biggest tangiblevalue contribution will be to accelerate innovative science through direct enablement of data science workflows and programmes designed to introduce data-driven decision-making, Accelerate efforts in AI, data and digital 02 03 - SDF is a key enabler of AZ’s Growth Through Innovation Strategy SDF Strategic Drivers Data lies at the heart of scientific workflows. By democratizing data through SDF, we will change our culture to one that is more collaborative and truth-seeking, where decisions are data-driven and where we increasingly perform as an enterprise team. Advance our culture Through creation of an enduring data supply-chain, SDF will increase AZ’s agility to: take advantage of new data analysis methodologies and technologies; incorporate and drive value from new data sources; and actively govern and manage data in response to changing ethical and legal requirements. Build and adapt capabilities for the future
  • 27.
    Science Data Foundation(SDF) Goals A foundational programme to enable the Growth Through Innovation Strategy. Create an enduring supply-chain of data of various types and across the discovery and development pipeline that will drive scalable, and efficient data science operations. Generate analytics-ready data Create an efficient and seamless experience throughout the chain of activities scientists undertake to undertake data science. From planning projects, obtaining the data they need, performing analyses using powerful tools to finally applying new insights systematically and at scale, into R&D pipelines. Seamless intent to insight Introduce governing principles, supported by technology, to minimise risk of data misuse by ensuring compliance to internal and external policies. This shall allow scientists to focus on innovative science, guided through compliant ‘paved-paths’ to R&D data. Ensure compliance by default
  • 28.
    Relationships Between SDFGoals Data sources: Operational systems, other data platforms, instruments and external sources Analytics Ready Data Intent to insight The analytics-ready data goal will take data from sources, standardise, integrate and enrich the information. This is then supplied into the intent to insight data-usage process. The intent to insight process will create a seamless data usage process that provides a compliant by default path to data use and analytics. Compliance by defaultThe compliance by default goal will act to help define or update policies, assert that the policies adhere to external regulations and ensure that the intent to insight process applies
  • 29.
    SDF Programme Structure SDFLeadership SDF Change & Comms Sources Workflows SDF-Core Data Platform SDF- Data Policy & Governance SDF-Data Find & Integrate SDF-Data Science StorageIngestion Curation ExplorationAccess Analysis SDF Capability Enabling Workstream Cross-data-type SDF Workstream Cross data-type data- management, -quality and -usage policies defined. Scientific data management platforms setup (e.g., reference data management, data catalogue). Governing procedures that apply policies to SDF processes. Provide cross-cutting capabilities that enable all data-type workstreams to develop against a consistent, supported data foundation. SDF Analytical Data-Type Workstream Analytical data type workstreams (ADD, Patient, Omics, Imaging) will prepare and process data and meta-data for ingestion into the core data platform. In doing so: the data will become accessible according to standard policies and access mechanism alongside other data types; Standard patterns of exploration and data science will be enabled, although data-type workstreams are required to develop highly data-specific exploration and modes of analysis (e.g., genome browser and ‘omic variant analysis for SDF-Omics). SDF Workstreams SDF Data Workstreams: ADD; Patient; Omics; Imaging
  • 30.
    Goal: Generate AnalyticsReady Data Ingest and cataloguing Standardise and improve quality Curation and enrichment Data Hosting ü Reduce the time taken and effort by data scientists to assemble data into a single place. ü Reduce costs associated with lost innovation opportunities due to scientific data being unfindable, unusable or inaccessible to analytics toolsets. Data availability triggers automated flow into hosting environment Automated cataloguing of data on ingest is an enabler of findability Data ingestion can be templated to ensure new data sources have low barrier to also becoming hosted. Data can be ‘cleaned up’ by applying standardisation of key terms and identifiers. Data quality can be measured to help ultimate consumers plan their analyses. Enrichment of information and metadata can be both automated and expert-driven to create greater data reuseability and thus value. All data and metadata is hosted through an accessible environment so that information discovery and analytics tools can gain systematic (yet secure) access. Full track and trace and monitoring of a maximally automated process supports content reporting and auditability. Target SDF Capabilities An enduring supply-chain of data of various types and across the discovery and development pipeline that will drive scalable, and efficient data science operations. Key requirements are: • Data quality and completeness • Machine readable metadata • A hosting environment that can be support access by other systems Description Benefit Strategy
  • 31.
    Goal: Seamless Intentto insight Ideation using Information Discovery Register intent and make data request Data is provisioned Analysis and insights Application of insight ü Reduce time and effort to generate and administer intent to insight activities allows greater scale and lower cost to reuse data. ü Reduce wasted effort associated with scientifically flawed or non-compliant data reuse requests. ü Increase analytics capabilities to drive innovation ü Improve experience and job satisfaction Single point of entry to simplify and lower barriers to data reuse. Powerful and intuitive information discovery tools and connection to other experts to enable scientific ideation Intent, data and analysis requirements captured and issued electronically to ensure governance with minimal administration effort Data provided to an analysis team in the desired format and analysis environment. Data compliance and security are by default. Bespoke data products also supported Powerful analysis environments to support data science & AI workflows. Insights are captured and traceable to requests QA triggered for investment decisions and external publications. Insights with potential as BAU decision-support processes will trigger further creation of productionised data- analytics pipelines Full track and trace and monitoring of a maximally automated process supports audit and process improvement Target SDF Capabilities The chain of activities scientists undertake to plan data science projects, obtain the data they need, perform analyses and finally apply new insights systematically and at scale into R&D pipelines. Description Benefit Strategy
  • 32.
    Goal: Ensure complianceby default Ethical and legal frameworks Manage data standards Securing our data Training ü Reduce likelihood of fines associated to legal or ethical misuse of data. ü Reduce the burden on scientists to become compliance experts and allow them more time to focus on science, leading to increased innovation- based revenue generation. Frameworks that are built into systems and processes are fit for innovation purposes; balancing potentially changing restrictions that prevent misuse with enablement of data science. Information that supports ethical and legal data reusability is machine readable and can be efficiently managed by the Data Office. Host systems are secure from cyber attack and only allow users to perform operations such as data access, copy or movement without increasing risk. Target SDF Capabilities Governing principles, supported by technology, to minimise risk of data misuse by ensuring compliance to internal and external policies. This shall allow scientists to focus on innovative science, guided through compliant ‘paved-paths’ to R&D data. Description Provide training on processes and systems so that compliant paths to request and access data are known. Compliance monitoring Active compliance monitoring to provide early warning of risks associated to data reuse, helping to target training and remediation. Benefit Strategy
  • 33.
    Ideation & discovery Intent & Request Data Provisioning Analysis Application ofinsight As a scientist the Data Office provides me with a single point of entry to begin a process to exploit our data and the information and data exploration tools to drive my scientific creativity and ideation. I am able to find and request data online. The Data Office is on hand to advise me on issues of compliance and they also help to put me in touch with other experts. From the point of creating my request, I can follow the process easily. Whether you are an expert in AI, visual informatics, or have more scientific than IT expertise, Data Office will help you get your data to the right analysis environment, including cutting-edge cloud environments. Data office helps to ensure the right quality processes are triggered for investment decisions and external use, meaning that we customers can focus more on the science. When we’ve generated a promising new exploratory model that could be productionised to drive real value, Data Office will help us ‘productionise’ the data flow alongside our IT colleagues. Data office can get the data to you in a format you need and to a place where you can perform your analysis in a compliant and secure way. This ranges from systematic data flows to bespoke ‘data products’. ‘Intent to Insight’ – the process experienced by our customers. Data office provides a single point of entry for gaining access to data Expert support and maximal automation through the process ensures efficiency yet data compliance by default End of 2020 target
  • 34.
    Deliverables, benefits and next steps NewTarget Biology AI-driven Lead Optimisation Driving re-use of clinical data: 1000 studies in 2019, 1 million patients in 2020
  • 35.
    Erik Schultes, PhD InternationalScience Coordinator GO FAIR International Support and Coordination Office Leiden Center for Data Science erik.schultes@go-fair.org https://www.go-fair.org http://orcid.org/0000-0001-8888-635X FAIR Digtial Objects for Automating Processes 14 May, 2020
  • 36.
    Wilkinson, M. D.et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016). Automating F, A, I and R
  • 37.
  • 38.
    Paris, October 28-29,2019 RDA / GEDE: FAIR Digital Objects
  • 39.
  • 40.
    FAIR Digital Objects Basedon Bonino 2019 minimal open standard linking the FDO components ‘everything else’ TBA
  • 41.
    FAIR Digital Objects Basedon Bonino 2019 minimal open standard linking the FDO components ‘everything else’ TBA 1) GUPRI resolution service 2) Recursive FDO construction
  • 42.
    minimal open standard linkingthe FDO components ‘everything else’ TBA Machine- actionable atom-to- atom configuration FAIR Molecule a FAIR Digital Object for molecular structure
  • 43.
    A minimal standard*for a machine-actionable** representation of molecular structure*** that can be the basis of organizing other heterogeneous (meta)data****. * Easy to follow, encourages voluntary adoption ** FAIR *** Foundational concept in the chemistry domain *** Knowlet-like clusters of assertions about molecular structure FAIR Molecule a FAIR Digital Object for molecular structure
  • 44.
    Why FAIR Molecules? •Chemical view of the world is ubiquitous (example: biomedicine) • Chemical data is vast and complex • Rate of chemical data production is vast and growing • FAIR solutions are welcomed
  • 45.
    FAIR Molecule Hackathon 21& 22 January, 2020 Hamburg https://osf.io/ft6wn/
  • 46.
    Tuesday January 21 •13:00 Lunch • 14:00 Welcome / Overview (Erik) • 14:30 Participants Introductions - Rajaram / Kees / Luiz (FDO) - Yuliia / Alessa / Nicola (Molecular Structure) - Robert / Barbara / Stuart (Concpetual Models) - Hao / Folkert (DataBiology) - Myles / Erik (use cases) - John / Robert (Launch Pads) • 16:00 Break • 16:20 Task Organization / Discussion • 18:00 Pizza dinner • 19:00 Continue as desired • 22:00 ZBW doors close Hackathon Agenda
  • 47.
    Tuesday January 21 •13:00 Lunch • 14:00 Welcome / Overview (Erik) • 14:30 Participants Introductions - Rajaram / Kees / Luiz (FDO) - Yuliia / Alessa / Nicola (Molecular Structure) - Robert / Barbara / Stuart (Concpetual Models) - Hao / Folkert (DataBiology) - Myles / Erik (use cases) - John / Robert (Launch Pads) • 16:00 Break • 16:20 Task Organization / Discussion • 18:00 Pizza dinner • 19:00 Continue as desired • 22:00 ZBW doors close Hackathon Agenda
  • 48.
    Tuesday January 21 •13:00 Lunch • 14:00 Welcome / Overview (Erik) • 14:30 Participants Introductions - Rajaram / Kees / Luiz (FDO) - Yuliia / Alessa / Nicola (Molecular Structure) - Robert / Barbara / Stuart (Concpetual Models) - Hao / Folkert (DataBiology) - Myles / Erik (use cases) - John / Robert (Launch Pads) • 16:00 Break • 16:20 Task Organization / Discussion • 18:00 Pizza dinner • 19:00 Continue as desired • 22:00 ZBW doors close Hackathon Agenda
  • 49.
    Tuesday January 21 •13:00 Lunch • 14:00 Welcome / Overview (Erik) • 14:30 Participants Introductions - Rajaram / Kees / Luiz (FDO) - Yuliia / Alessa / Nicola (Molecular Structure) - Robert / Barbara / Stuart (Concpetual Models) - Hao / Folkert (DataBiology) - Myles / Erik (use cases) - John / Robert (Launch Pads) • 16:00 Break • 16:20 Task Organization / Discussion • 18:00 Pizza dinner • 19:00 Continue as desired • 22:00 ZBW doors close Hackathon Agenda
  • 50.
    Tuesday January 21 •13:00 Lunch • 14:00 Welcome / Overview (Erik) • 14:30 Participants Introductions - Rajaram / Kees / Luiz (FDO) - Yuliia / Alessa / Nicola (Molecular Structure) - Robert / Barbara / Stuart (Concpetual Models) - Hao / Folkert (DataBiology) - Myles / Erik (use cases) - John / Robert (Launch Pads) • 16:00 Break • 16:20 Task Organization / Discussion • 18:00 Pizza dinner • 19:00 Continue as desired • 22:00 ZBW doors close Hackathon Agenda
  • 51.
    Tuesday January 21 •13:00 Lunch • 14:00 Welcome / Overview (Erik) • 14:30 Participants Introductions - Rajaram / Kees / Luiz (FDO) - Yuliia / Alessa / Nicola (Molecular Structure) - Robert / Barbara / Stuart (Concpetual Models) - Hao / Folkert (DataBiology) - Myles / Erik (use cases) - John / Robert (Launch Pads) • 16:00 Break • 16:20 Task Organization / Discussion • 18:00 Pizza dinner • 19:00 Continue as desired • 22:00 ZBW doors close Hackathon Agenda
  • 52.
    Goal: Show FAIR interoperationbetween data & code Hackathon Agenda
  • 53.
    Resolves to GUPRI ePIC FAIR DigitalObject Record fdo:digitalObjectOfType fdo:MGFile ; fdo:locationOfDO <https://hackathon.fair-dtls.surf-hosted.nl/EL/> ; datacite:hasIdentifier :identifier ; dct:conformsTo <https://hackathon.fair-dtls.surf-hosted.nl/shacl-record.ttl> . fdof:hasResourceLocation Resource fdo:digitalObjectOfType Type MG File fdof:hasMetadata fdof:isMetadataOf Extensible Metadata # metadata section #<http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID702> ; # Ethanol #<http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID5280450> ; # Lineoleic acid #<http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID5282184> . # Ethyl Lineolate :elMetadata :respresents :molecule . :molecule :molecularWeight "308.47"^^:gramsPerMol ; skos:prefLabel "Ethyl Lineolate" ; skos:notation "C20H36O2" ; :cas "544-35-4" ; <http://semanticscience.org/resource/SIO_000212> <http://dx.doi.org/10.1002/ anie.201801332> ; # is referred to by :availableAt <https://www.sigmaaldrich.com/catalog/search? term=ethyl+linoleate&interface=All&N=0&mode=match%20partialmax&lang=en&regi on=US&focus=product> . # provenance :elMetadata dct:contributor orcid:0000-0002-8042-4131 . orcid:0000-0002-8042-4131 a foaf:Person ; foaf:name "Myles Axton" ; pro:holdsRoleInTime [ a pro:RoleInTime ; pro:withRole scoro:investigator-role ; ] .
  • 54.
    • GUPRI • FDORecord • Type - Molecular Graph • Extensible Metadata • Resource - molecular structure • GUPRI • FDO Record • Type - .mol • Extensible Metadata • Resource - molecular structure • GUPRI • FDO Record • Type - File conversion script • Extensible Metadata • Resource - Docker image FAIR Molecule 1 FAIR Molecule 2FDO for scripts FAIR Molecule Hackathon
  • 55.
    • GUPRI • FDORecord • Type - Molecular Graph • Extensible Metadata • Resource - molecular structure • GUPRI • FDO Record • Type - .mol • Extensible Metadata • Resource - molecular structure • GUPRI • FDO Record • Type - File conversion script • Extensible Metadata • Resource - Docker image FAIR Molecule 1 FAIR Molecule 2FDO for scripts FAIR Molecule Hackathon FDO Orchestration
  • 56.
    FAIR Molecule Established Knowledge- chemical informatics Real World Observations - lab automation Virtual World Observations - computer simulations chemify.org
  • 57.
    FAIR Molecules asDigital Twins https://www.manufacturingleadershipcouncil.com/2019/12/02/digital-twins/
  • 58.
    FAIR Molecules asDigital Twins
  • 59.
    FAIR Molecules asDigital Twins chemify.org
  • 60.
  • 61.
  • 62.
  • 63.
    Convergence Resource 1 Resource 2 Resource3 Resource 4 Resource 5 Resource 6 Resource 7 Resource 8 F A I R 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0 1 0 1 0 1 1 1 Communities Resources FAIR Implementation Profiles Convergence Matrix http://www.data-intelligence-journal.org/p/47/ Reusing FIPs https://osf.io/8sv5f/
  • 64.
    Convergence • FIPs arereusable = drives convergence • FIPs guarantee interoperation • FIPs inform data stewardship plans FIPs are the DNA of the DMP Convergence Matrix http://www.data-intelligence-journal.org/p/47/ Reusing FIPs https://osf.io/8sv5f/
  • 65.
    Convergence Resource 1 Resource 2 Resource3 Resource 4 Resource 5 Resource 6 Resource 7 Resource 8 F A I R 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0 1 0 1 0 1 1 1 Communities Resources Pharma Industry Challenge: Develop a common pre-competative FIP
  • 66.
  • 68.
    3 communities buildingFAIR distributed learning platforms “FAIR Data Trains” Barbra Magagna, Umweltbundesamt GmbH Kristina Hettne, CDS University Library
  • 69.
    3 communities buildingFAIR distributed learning platforms “FAIR Data Trains” Barbra Magagna, Umweltbundesamt GmbH Kristina Hettne, CDS University Library Choice Challenge
  • 70.
    3 communities buildingFAIR distributed learning platforms “FAIR Data Trains” Barbra Magagna, Umweltbundesamt GmbH Kristina Hettne, CDS University Library Choice Challenge A F R I
  • 71.
    FAIR automation and FAIRapplications May 2020 Georges Heiter
  • 72.
    Copyright ©2020. AllRights Reserved. Confidential Databiology Ltd. AUTOMATION
  • 73.
    Copyright ©2020. AllRights Reserved. Confidential Databiology Ltd. Humans are manually involved in every step of the research process Bulk of energy is still spent on finding and preparing data for analysis Metadata about digital assets and the operations upon them is mostly not being captured → research not easily repeatable or automatable
  • 74.
    Copyright ©2020. AllRights Reserved. Confidential Databiology Ltd.Page ▪ 4 Data Usage Challenge (making data actionable) Data Source Model 1 Data Source Model 2 Data Source Model n Analysis Data Model 1 Analysis Data Model 2 Analysis Data Model n Knowledge Network Data Source Model Narrow scope Specialized Use case specific and/or Proprietary Domain Scope Flexibility Knowledge Network Multi-domain Broad & standardized Growing/changing Analysis Data Model Cross-domain Specialized Use case specific and/or Proprietary
  • 75.
    Copyright ©2020. AllRights Reserved. Confidential Databiology Ltd.Page ▪ 5 Machine Actionable Components as Foundation of the Knowledge Network
  • 76.
    Copyright ©2020. AllRights Reserved. Confidential Databiology Ltd. ▪ Findable with unique ID, and digitally signed ▪ Accessible in an associated permanent registry ▪ Interoperable because they rely on standards ▪ Reusable as self-contained and fully portable ▪ Software integrity and quality ▪ Customizable Page ▪ 6 What makes an application FAIR?
  • 77.
    Copyright ©2020. AllRights Reserved. Confidential Databiology Ltd. App metadata − Name − Version − Author − Description − Inputs − Outputs − Parameters − License − Original source − Reference data dependencies Page ▪ 7 CIAO App – software packaged with metadata https://hub.databiology.net/app-dbio-blast/tags/2.9.1 docker pull hub.databiology.net/app-dbio/blast:2.9.1 App are stored and distributed in a repository with unique id: Metadata is integrated in the container CIAO app Code Aux Metadata
  • 78.
    Copyright ©2020. AllRights Reserved. Confidential Databiology Ltd. Sets Page ▪ 8 CIAO app instance Links the app to an infrastructure and organizational context Defines
  • 79.
    Copyright ©2020. AllRights Reserved. Confidential Databiology Ltd. The workunit will keep record of: Page ▪ 9 CIAO App run App instance execution record − App instance used − App status − Inputs and outputs − Execution versions − Execution logs − Infrastructure used − Keeps data provenance
  • 80.
    Copyright ©2020. AllRights Reserved. Confidential Databiology Ltd.Page ▪ 10 CIAO apps evolution – progressive layering of metadata to make apps FAIR CIAO app Code Aux Data Metadata CIAO app instance Storages Compute Security CIAO app run (Workunit) Inputs Outputs Parameters Policies Logs Versions CIAO app Code Aux Data Metadata CIAO app instance Storages Compute Security Policies CIAO app Code Aux Data Metadata
  • 81.
    Copyright ©2020. AllRights Reserved. Confidential Databiology Ltd.Page ▪ 11 Machine actionable policies and secrets Policy based Consent Management ▪ Policies make use of metadata − Example: Define consent tags on studies, datasets and entities − Key operations on data, applications and infrastructure subject to policy − Granularity vs scalability ▪ Stand-alone Policy service − System landscape enforces policies managed in policy service −OPA (https://www.openpolicyagent.org/) ▪ Stand-alone secret management − Facilitation of security workflows
  • 82.
    Copyright ©2020. AllRights Reserved. Confidential Databiology Ltd.Page ▪ 12 Composability
  • 83.
    Copyright ©2020. AllRights Reserved. Confidential Databiology Ltd.Page ▪ 13 Databiology Approach – Intelligent Automation powered by a Knowledge Network that converges Data, Applications, Infrastructure and Organizations Knowledge Network Intelligent Automation INFRASTRUCTURE Source1 Source2 Sourcen DATA App1 AppnApp2 APPLICATIONS Compute Site Application Orchestration Engine Knowledge Engine PEOPLE,ORGS&POLICIES Data
  • 84.
    Copyright ©2020. AllRights Reserved. Confidential Databiology Ltd. Data Modeling ▪ Entity Definition / MDS ▪ Terminology Service (Ontologies) ▪ Policy Service Page ▪ 14 Knowledge Engine Data Discovery ▪ Search ▪ Ontology Mapping ▪ Collection Management ▪ Federated Search Data Ingestion ▪ Aggregation − Multi-Channel (Batch / Stream) − Enrichment − Validation − Origination (Lineage / Provenance) − Persistence ▪ Federation − Federated Data Sources − Origination (Lineage / Provenance) − Caching
  • 85.
    Copyright ©2020. AllRights Reserved. Confidential Databiology Ltd. ▪ Secret Management − Secure Credential Store − Security workflows to provide secrets to orchestration processes ▪ Workunit Management − Inspection (Real-time monitoring) − Monitoring & Logging − State control (Real-time) ▪ Compute Capacity Management − Provisioning /Deprovisioning Dynamic Capacity (VMs) − Cloud Providers − On-premise technologies ▪ Data Orchestration − Data Transport (unstructured) o Transfer Protocols − Data Projections (entity data) o Covers Ingress and Egress entity data ▪ Application Orchestration − Application Registry − Application Transport − Dynamic Proxying of Interactive Apps to user browser Page ▪ 15 Orchestration Engine
  • 86.
    Copyright ©2020. AllRights Reserved. Confidential Databiology Ltd.Page ▪ 16 Example: Contextually aware research assistant delivers intelligent automation Automatically routes to and executes analysis on the most suitable infrastructure Automatically extracts insights and feeds them back into the knowledge graph Suggests analysis apps based on contextual data, including the user’s data selection and their previous analysis runs
  • 87.
    Copyright ©2020. AllRights Reserved. Confidential Databiology Ltd. Automation will free researchers to focus on higher level tasks Let machines will take over manual labor intensive functions to allow researchers to focus on ideation and creativity for LOWER COST PER INSIGHT
  • 88.
    Copyright ©2020. AllRights Reserved. Confidential Databiology Ltd.Page ▪ 18 Intelligent Automation Value Increase Researcher effectiveness and efficiency • Discover more across a federated knowledge network and collaborate securely • Automation and AI allow researchers to focus on the science instead of the IT • Always use best in class analytics tools to get the most insights out of data Mitigated Risk • Automatic audit trail, provenance and reproducibility • Future-proof due to no technology stack lock-ins (composability side effect) • Lasting data integration and interoperability (knowledge network) Lower the cost per insight • Achieve higher levels of automation -> contextual aware assistance • Eliminate duplication of effort • Speed to value measured in weeks not years
  • 89.
    Copyright ©2020. AllRights Reserved. Confidential Databiology Ltd.Page ▪ 19 Do You Have Any Questions?
  • 90.
  • 91.
    ©PistoiaAlliance Prepared Questions 2 1. Whatare different flavours of FAIR implementation and application for Life Science industry? 2. What is the low hanging fruit and likely challenges of FAIR implementation and application by Life Science industry? 3. What would a common specification for FAIR digital objects look like? Why is this important for Life Science industry? Questions from the audience
  • 92.
    Get Involved! Join theFAIR Implementation project Ian Harrow Ian.harrow@pistoiaalliance.org Membership: membership@pistoiaalliance.org General Enquiries: Zahid Tharia – zahid.tharia@pistoiaalliance.org www.pistoiaalliance.orgwww.pistoiaalliance.org
  • 93.
    Next Webinar Lab ofthe Future Thurs, May 21st, 2020, 14:30 – 16:00 BST www.pistoiaalliance.org/webinars-2020/
  • 94.