Using the FREYA PID Graph to help reproduce scientific research

Using PID Graph to reproduce
research
Marcus Povey and Claudia Alen Amaro
FREYA Wrap up meeting, Amsterdam
November 2020

Instruct-ERIC is the single point of access to technology and expertise for
structural biology research.
The Instruct consortium comprises ten Instruct Centres that offer
access to 23 research sites across Europe.
Instruct has 15 Members that each pay an annual subscription to allow their scientists to access the
range of services that are available through Instruct.
1. Instruct Centre BE
2. Instruct Centre CZ
3. Instruct Centre ES
4. Instruct Centre FI
5. Instruct Centre FR1
6. Instruct Centre FR2
7. Instruct Centre IL
8. Instruct Centre IT
9. Instruct Centre NL
10. Instruct Centre UK
Belgium
Italy
Denmark
Netherlands
Czech Republic
Latvia
Finland
Slovakia
France
Spain
Israel
UKPortugal
EMBL
Lithuania

Instruct’s extensive
service catalogue
encourages a
multidisciplinary
approach to structural
biology research.
Technology
Catalogue
Instruct offers access to
over 75 different
services, from sample
preparation to
biomolecular and 3D
structural analyses.

Instruct is an ESFRI landmark and an ERIC
We are working together with other Life
science research infrastructures in the
project EOSC-Life.
We have been collaborating with: FREYA and
OpenAire, preparing to engage with EOSC
and to make sure that our data is FAIR
We have developed our own proposal
management system: ARIA

Our Mission
• ACCESS – Facilitating access to cutting
edge research infrastructure and
methods
• FACILITY – Helping research
infrastructures manage their equipment,
and representing their interests
• COMMUNITY – Contributing to the wider
scientific community as a whole, and
helping researchers, projects and
infrastructures work better together
• DATA – Improving access to research data,
and facilitating Open Access
• ARIA Cloud!
DATA
FACILITY
ACCESS
COMMUNITY

Structural Biologists
use Microscopes
• 12 Samples (grids) in a loader
• Each grid can potentially have multiple
structures that are of interest (projects
with 96 well grids are underway)
• Outputs ~1-3TB of HD Video per day

Electron Microscopy
Researcher submits a
proposal for access
Researcher produces a
sample locally
Sample is loaded
onto a grid
Grid goes into Electron
Microscope
Micrographs go into
pre-processor
Particle picking, auto &
manual processing
Datasets are
analysed by 10s of
software packages
3D structure
determined
Structure deposited
into PDB/EM-DB
publication to journal

There are a lot of
things we might want
to track…
• Number of sample grids
• Potentially multiplied by samples on
a grid
• Multiplied by grids in a microscope
• Multiplied by frames of video
• Multiplied by number of microscopes per
facility
• … multiplied by the number of facilities.

... But wait, there’s
more!
• We need to know the data processing
workflows used
• We need to identify samples and
associated metadata
• We need to know a given machine’s
configuration
• Software and software versions used to
process and analyse data
• Researchers involved in project
• Funding applications (proposals)

Structural Biologists
also use
Synchrotrons…
• Similarly large data volumes
• Similarly complex machine configurations
• Similarly complex data processing
workflow to produce results

Crystallography
proposal for access
Researcher produces a
sample locally
Sample added to
crystal plate
Crystal plate imaged
regularly
Crystals loaded onto
pins
Crystals shot with X-
Rays at synchrotron
Diffraction pattern
auto-analysis and re-
running
3D structure
determined
Structure deposited
into PDB
publication to journal

How can we help
make the research
reproduceable?
• There are a lot of parts to keep track of!
• Data sets are often too large to practically
move about
• Machine configurations are often only
available on the machine itself
• Software gets modified
• How do we make this findable, accessible
and reusable?
• Can the PID graph help?

Building an
experimental session
• At the beginning of an experimental
session, mint a PID.
• It is also a good idea to reference the
institution in which this takes place, if
available
• Mint a PID for all relevant assets and data
as session progresses
• (”Relevant” is highly application specific)
• Each asset PID “cites” the experimental
identifier

Building a "research
bundle"
• Once the output is produced a research
identifier is minted
• Which links one or more experimental
sessions
• A link is established between the
identifier and the output

Using a “Research
bundle”
• Interrogate the published output
• Use the PID graph to find associated
datasets

Using a “Research
bundle”
• Next, find one or more experimental
sessions
• Interrogate relatedIdentifiers to
find referenced Dataset nodes

Producing a “Research
bundle”
• Finally, expand the experimental session
• Use the graph in a similar method to User
Story 8 (Fenner, 2020)
• For each data set, collect its parts.

What a tool might look like…

What a tool might look like…
• There are three main tasks
• Minting PIDs for the assets
• Linking those PIDs together to produce a bundle
• Retrieving the bundle and presenting them in a structured way
• All this needs to be wrapped into one or more services to be useful

Experiment session
manager
• A tool primarily used by infrastructures to
produce the experimental session and
research identifier
• Service provides APIs to mint identifiers
for assets
• Service manages associating assets with
the session record.

Experiment session manager
• Researcher makes a booking / drops in to use a machine
• They scan a QR code and are taken to their booking on a website
• They log in using their ORCID and click “Begin Session”
• PID is minted identifying session
• System adds machine PID
• System adds PID for booking, machine configuration. Etc
• During the session other outputs can be added
• Facility staff can add additional information (processed data, samples
etc)
• ARIA could be extended to do this!

The Claiming tool
• When the output has been produced, we
need to link this to all the outputs and
create a “bundle”
• For ARIA users, this would be as simple as
adding the PID of the output to their
proposal
• This is an often requested feature!
• For others, we need the help of the PID
graph

The Claiming tool
• Researcher logs in to a service with their ORCID
• Enter the PID of their output
• A search is performed identifying datasets produced by the author,
allowing selection to be added to a bundle
• Perhaps further optimised by excluding those not “part of” an existing dataset
• They “confirm” the process, a new bundle is created and a PID
minted.

Research bundle
viewer
• Brings this all together in a simple site
• Landing page where a visitor could enter
the PID
• Performs searches and provides an
interface to drill down into the relevant
research bundles.
• … in usable way

Next steps…
• Still very much in the concept stage
• But much of this functionality has been
requested by our users
• We have a commitment to better help
link their data…
• Watch this space!
DATA
FACILITY
ACCESS
COMMUNITY

@ARIA_access
aria@instruct-eric.eu
Thanks!

Using the FREYA PID Graph to help reproduce scientific research

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Using the FREYA PID Graph to help reproduce scientific research