Instruct-ERIC facilitates access to cutting edge research infrastructure, within the domain of Structural Biology. Through Instruct, researchers can gain access to techniques, services and equipment in order to perform their research.
The operation of some of this equipment is, to say the least, complex, and for any given piece of research output, there are a large number of inputs that contribute to that output. Many of these inputs are critical for someone who wants to reproduce a given piece of research, and we need to make these findable. In addition, by identifying and documenting every step in the process, from proposal to publication, it should be possible
to reproduce and optimise these preparation techniques which could also identify commonalities in the process.
Inputs for a given research output may include but are not limited to; the sample (origin, characteristics, purity, descriptors), its
preparation (reagents, formulation, modifications), the machine(s) used, the configuration of the machine(s) at the time, one or more areas of interest in multiple frames of multiple terabytes of HD video, the software and software version, algorithm and parameters used to perform any data processing, the researchers involved, and more.
An additional complexity is that some of this information may need to remain in situ, either because it’s impractical to move due to the volume of data involved, or because of contractual or ethical confidentiality reasons.
Using the PID Graph (Fenner & Aryani, 2019) would give us the opportunity to construct, on demand, a “research bundle”. This would tell any future researcher, in detail, what inputs went in to producing any given research output and where that data resides. This would allow our research infrastructures to conform to FAIR principles, as well as meet any requisite data protection obligations.
<a href="https://doi.org/10.5281/zenodo.4275872"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.4275872.svg" alt="DOI"></a>
<a href="https://doi.org/10.5281/zenodo.4277945"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.4277945.svg" alt="DOI"></a>
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Using the FREYA PID Graph to help reproduce scientific research
1. Using PID Graph to reproduce
research
Marcus Povey and Claudia Alen Amaro
FREYA Wrap up meeting, Amsterdam
November 2020
2. Instruct-ERIC is the single point of access to technology and expertise for
structural biology research.
The Instruct consortium comprises ten Instruct Centres that offer
access to 23 research sites across Europe.
Instruct has 15 Members that each pay an annual subscription to allow their scientists to access the
range of services that are available through Instruct.
1. Instruct Centre BE
2. Instruct Centre CZ
3. Instruct Centre ES
4. Instruct Centre FI
5. Instruct Centre FR1
6. Instruct Centre FR2
7. Instruct Centre IL
8. Instruct Centre IT
9. Instruct Centre NL
10. Instruct Centre UK
Belgium
Italy
Denmark
Netherlands
Czech Republic
Latvia
Finland
Slovakia
France
Spain
Israel
UKPortugal
EMBL
Lithuania
3. Instruct’s extensive
service catalogue
encourages a
multidisciplinary
approach to structural
biology research.
Technology
Catalogue
Instruct offers access to
over 75 different
services, from sample
preparation to
biomolecular and 3D
structural analyses.
4. Instruct is an ESFRI landmark and an ERIC
We are working together with other Life
science research infrastructures in the
project EOSC-Life.
We have been collaborating with: FREYA and
OpenAire, preparing to engage with EOSC
and to make sure that our data is FAIR
We have developed our own proposal
management system: ARIA
5. Our Mission
• ACCESS – Facilitating access to cutting
edge research infrastructure and
methods
• FACILITY – Helping research
infrastructures manage their equipment,
and representing their interests
• COMMUNITY – Contributing to the wider
scientific community as a whole, and
helping researchers, projects and
infrastructures work better together
• DATA – Improving access to research data,
and facilitating Open Access
• ARIA Cloud!
DATA
FACILITY
ACCESS
COMMUNITY
6. Structural Biologists
use Microscopes
• 12 Samples (grids) in a loader
• Each grid can potentially have multiple
structures that are of interest (projects
with 96 well grids are underway)
• Outputs ~1-3TB of HD Video per day
7. Electron Microscopy
Researcher submits a
proposal for access
Researcher produces a
sample locally
Sample is loaded
onto a grid
Grid goes into Electron
Microscope
Micrographs go into
pre-processor
Particle picking, auto &
manual processing
Datasets are
analysed by 10s of
software packages
3D structure
determined
Structure deposited
into PDB/EM-DB
Researcher submits a
publication to journal
8. There are a lot of
things we might want
to track…
• Number of sample grids
• Potentially multiplied by samples on
a grid
• Multiplied by grids in a microscope
• Multiplied by frames of video
• Multiplied by number of microscopes per
facility
• … multiplied by the number of facilities.
9. ... But wait, there’s
more!
• We need to know the data processing
workflows used
• We need to identify samples and
associated metadata
• We need to know a given machine’s
configuration
• Software and software versions used to
process and analyse data
• Researchers involved in project
• Funding applications (proposals)
11. Crystallography
Researcher submits a
proposal for access
Researcher produces a
sample locally
Sample added to
crystal plate
Crystal plate imaged
regularly
Crystals loaded onto
pins
Crystals shot with X-
Rays at synchrotron
Diffraction pattern
auto-analysis and re-
running
3D structure
determined
Structure deposited
into PDB
Researcher submits a
publication to journal
12. How can we help
make the research
reproduceable?
• There are a lot of parts to keep track of!
• Data sets are often too large to practically
move about
• Machine configurations are often only
available on the machine itself
• Software gets modified
• How do we make this findable, accessible
and reusable?
• Can the PID graph help?
14. Building an
experimental session
• At the beginning of an experimental
session, mint a PID.
• It is also a good idea to reference the
institution in which this takes place, if
available
• Mint a PID for all relevant assets and data
as session progresses
• (”Relevant” is highly application specific)
• Each asset PID “cites” the experimental
identifier
15. Building a "research
bundle"
• Once the output is produced a research
identifier is minted
• Which links one or more experimental
sessions
• A link is established between the
identifier and the output
16. Using a “Research
bundle”
• Interrogate the published output
• Use the PID graph to find associated
datasets
17. Using a “Research
bundle”
• Next, find one or more experimental
sessions
• Interrogate relatedIdentifiers to
find referenced Dataset nodes
18. Producing a “Research
bundle”
• Finally, expand the experimental session
• Use the graph in a similar method to User
Story 8 (Fenner, 2020)
• For each data set, collect its parts.
20. What a tool might look like…
• There are three main tasks
• Minting PIDs for the assets
• Linking those PIDs together to produce a bundle
• Retrieving the bundle and presenting them in a structured way
• All this needs to be wrapped into one or more services to be useful
21. Experiment session
manager
• A tool primarily used by infrastructures to
produce the experimental session and
research identifier
• Service provides APIs to mint identifiers
for assets
• Service manages associating assets with
the session record.
22. Experiment session manager
• Researcher makes a booking / drops in to use a machine
• They scan a QR code and are taken to their booking on a website
• They log in using their ORCID and click “Begin Session”
• PID is minted identifying session
• System adds machine PID
• System adds PID for booking, machine configuration. Etc
• During the session other outputs can be added
• Facility staff can add additional information (processed data, samples
etc)
• ARIA could be extended to do this!
23. The Claiming tool
• When the output has been produced, we
need to link this to all the outputs and
create a “bundle”
• For ARIA users, this would be as simple as
adding the PID of the output to their
proposal
• This is an often requested feature!
• For others, we need the help of the PID
graph
24. The Claiming tool
• Researcher logs in to a service with their ORCID
• Enter the PID of their output
• A search is performed identifying datasets produced by the author,
allowing selection to be added to a bundle
• Perhaps further optimised by excluding those not “part of” an existing dataset
• They “confirm” the process, a new bundle is created and a PID
minted.
25. Research bundle
viewer
• Brings this all together in a simple site
• Landing page where a visitor could enter
the PID
• Performs searches and provides an
interface to drill down into the relevant
research bundles.
• … in usable way
27. Next steps…
• Still very much in the concept stage
• But much of this functionality has been
requested by our users
• We have a commitment to better help
link their data…
• Watch this space!
DATA
FACILITY
ACCESS
COMMUNITY