A short overview of the Donders Repository
https://data.donders.ru.nl
Presentation for the Erwin Hahn Institute - 3 February 2021
Robert Oostenveld
r.oostenveld@donders.ru.nl
The Donders Repository
Outline
The aims of the Donders Repository
Procedural design
Technical architecture
Fitting it into the researchers daily work
What data goes where?
The timeline of a project and its data
Closing collections, review and versions
Making open data FAIR
BIDS as specific standard for neuroimaging data
Demonstration
Automatic data flow
Informed consent and GDPR
Data Use Agreements
About myself
Member of the Research Infrastructure Committee since 2003.
MEG physicist since 2009.
Affiliated researcher at Karolinska Institutet since 2015.
Associate PI at DCCN since 2020.
Main research interest in the development of data analysis
methods for MEG/EEG, such as source reconstruction,
spectral analysis methods and stats.
Also strong interest in Open Science and Team Science.
Shortly after the “F.C. Donders Centre” started, I initiated the
FieldTrip project (together with colleagues).
In 2010 I got involved in the Human Connectome Project.
In 2014 I got involved with the Donders- and RU-wide RDM efforts, together with
Eric Maris, Erik van der Boogert, and Hurng-Chung Lee (the DR “core team”).
Aims for the Donders Repository
Research
initiation
Data
acquisition
Data analysis &
documentation
Data
sharing
Secure the original research data
Document the research process
Make the data accessible to the right people
Your PI, your current collaborators, your future successors
Your audience, i.e. sharing of open data
Keep the data accessible in the future (beyond your contract)
The Donders Repository
Summary of the development timeline
Active contributors: University board, Legal department, Security officer, central IT department,
Research Information Services (library etc.), other interested institutes
2014: initial planning
2015: design of data management protocols and specification of IT requirements
2016: first implementation accessible to researchers
2016-2018:refinements and improvements, increase adoption
2019-now: scale up to the whole Radboud University, improvements to scalability
The Donders Repository is presently used by 1873 researchers to manage their research data,
organized in approximately 1500 collections with about 150 TB data. There are 144 published
(open access) data sharing collections.
https://data.donders.ru.nl
https://data.ru.nl
The Donders Repository
Procedural design
Different roles:
administration, managers, contributors, viewers
Different collections:
for raw data (Data Acquisition Collection, DAC)
for processed data (Research Documentation Collection, RDC)
for publicly shared data (Data Sharing Collection, DSC)
Collection states:
Open/editable (read-write)
Internal/external review (read-only)
Archived or Published (permanent read-only, DOI)
It should allow for large data (1000s of files, 100s of GB) that are organized
per collection by the researcher. No zip files required, no limits.
Authenticated access from inside and outside the institute. Not a replacement
for the (much faster) work-in-progress storage systems.Suitable as a long-
term (>10 year) archive. It should be scalable and grow along with our needs,
also when IT subsystems change.
Research
initiation
Data
acquisition
Data analysis &
documentation
Data
sharing
The Donders Repository
Technical architecture
IRODS/ICAT
Low-level (meta)data management software
JAVA
middleware
JAVA frontend
web
access
Elastic stack
(ELK)
WebDAV future …
file
access
Scalable network-attached storage system
Isilon, Compellent, CephFS, future …
Replication
storage
IRODS/ICAT
future
…
Federated IDP
Surfconext &
ORCID (SAML)
past …
DataCite DOI
The Donders Repository
Thinking along with the researchers’ struggles with their data
What to do when?
- While preparing the project…
- While acquiring the data…
- When implementing the analysis pipeline…
- When finalizing the manuscript for submission…
Making it attractive for our (junior) researchers.
This is part of the embedding that researchers get
at the DCCN, it also includes support for ethics,
experimental design, acquisition, analysis, high-
performance computing, etc.
The typical PhD student or Postdoc goes through
this cycle a few times while at the DCCN.
Research
initiation
Data
acquisition
Data analysis &
documentation
Data
sharing
Scanners and labs
Donders Repository
Central storage /project/30xx0yy.zz
DICOM, physiology,
eyetracking, MEG,
presentation log files,
questionaires, …
“raw”
analysis scripts and intermediate results
”shared”
DAC DSC RDC
convert,
deface,
etc
de-identified data
and some results
The Donders Repository
The timeline of a project and its data
Researchers present a project proposal, gets a PPM number
The administrator creates a data acquisition collection, the PI and
the researcher are usually both managers
The researcher collects and archives raw data
The researcher analyzes the data
The researcher writes and submits a manuscript
The administrator creates a research documentation and a
data sharing collection
The researcher moves the processed data to the archive
The researcher moves the to-be-published-data to the archive
The three collections are closed
Research
initiation
Data
acquisition
Data analysis &
documentation
Data
sharing
Reviewing and archiving data using the Donders Repository
Collection states for internal and for raw data
Editable/
open
Internal
review
Archived
This creates a new version with the same DOI, the old version remains available as well
Reviewing and publishing data using the Donders Repository
Collection states for shared/published data
Editable/
open
Internal
review
External
review
Published
This creates a new version with the same DOI, the old version remains available as well
The Donders Repository
Making Open Data FAIR
Findable
Make your data available on repository with a persistent identifier (DOI,
handle) and metadata
Accessible
Be explicit about data usage terms (agreement with downloader)
Interoperable
Make your data human and machine readable, e.g. BIDS
Reusable
Make sure you document enough details, e.g. as “data descriptor” paper
which can be cited, along with citing your data -> measurable impact!
The Donders Repository
Making Open Data FAIR
The Donders Repository takes care of
 Storage
 Procedures and protocols
 Roles and responsibilities
 Long-term management
 Authentication and authorization
 Internal data flow/access
 Data use agreements
 External data access
 Pushing metadata to RIS, NARCIS and Google
The Donders Repository
Making Open Data FAIR
The Donders Institute is very broad, with 4 centres
over 3 faculties, 80 principal investigators and
some 800 researchers.
We don’t impose explicit standards for how to organize
and store generic neuroscience data or metadata.
Only minimal metadata at the collection level.
Multiple domain-specific standards needed for I and R.
BIDS is a way to organize your existing raw data
To improve consistent and complete documentation
To facilitate re-use by your future self and others
BIDS is not
A new file format
A search engine
A data sharing tool
Making human neuroimaging data FAIR
https://bids-standard.org
https://github.com/Donders-Institute/bidscoin
Making human neuroimaging data FAIR
https://bids-standard.org
https://github.com/Donders-Institute/bidscoin
The Donders Repository
Summary
The aims of the Donders Repository
Procedural design
Technical architecture
Fitting it into the researchers daily work
What data goes where?
The timeline of a project and its data
Closing collections, review and versions
Making open data FAIR
BIDS as specific standard for neuroimaging data
Demonstration
Automatic data flow
Informed consent and GDPR
Data Use Agreements
www.ru.nl/donders
https://data.donders.ru.nl
r.oostenveld@donders.ru.nl
Demo datasets:
https://doi.org/10.34973/3jk5-6j57 : mouse data, CC-BY
https://doi.org/10.34973/j05g-fr58 : “Dr Who” MRI, RU-DI-HD
The Donders Repository
Data Use Agreement - Data Sharing Collections
Data use agreement for identifiable human data
Version RU-DI-HD-1.0
I request access to the data collected in the digital repository of the Donders Institute for Brain, Cognition and Behaviour, part of the Radboud University,
established at Nijmegen, the Netherlands (hereinafter referred to as the Donders Institute), and I agree to the following:
1. I will comply with all relevant rules and regulations imposed by my institution and my government. This may mean that I need my research to be approved or
declared exempt by a committee that oversees research on human subjects, e.g. my Institutional Review Board or Ethics Committee.
2. I will not attempt to establish the identity of or attempt to contact any of the included human subjects. I will not link this data to any other database in a way
that could provide identifying information. I understand that under no circumstances will the code that would link these data to an individuals personal
information be given to me, nor will any additional information about individual subjects be released to me under these Data Use Terms.
3. I will not redistribute or share the data with others, including individuals in my research group, unless they have independently applied and been granted
access to this data.
4. I will acknowledge the use of the data and data derived from the data when publicly presenting any results or algorithms that benefitted from their use.
(a) Papers, book chapters, books, posters, oral presentations, and all other presentations of results derived from the data should acknowledge the origin of the
data as follows: "Data were provided (in part) by the Donders Institute for Brain, Cognition and Behaviour".
(b) Authors of publications or presentations using the data should cite relevant publications describing the methods developed and used by the Donders
Institute to acquire and process the data. The specific publications that are appropriate to cite in any given study will depend on what the data were used and
for what purposes. When applicable, a list of publications will be included in the collection.
(c) Neither the Donders Institute or Radboud University, nor the researchers that provide this data should be included as an author of publications or
presentations if this authorship would be based solely on the use of this data.
5. Failure to abide by these guidelines will result in termination of my privileges to access to these data.
I will not attempt to establish the identity of or attempt to
contact any of the included human subjects. I will not link this
data to any other database that could provide identifying …
Typical (inefficient) reuse of raw data
acquisition
PPM
DAC
analysis publication
DSC
RDC
analysis publication
DSC
RDC
analysis publication
DSC
RDC
Year 0 Year N
Never
More efficient reuse of shared/published data
acquisition
PPM
DAC
specific
analysis
publication
RDC
common
preproc.
DSC
specific
analysis
publication
RDC
specific
analysis
publication
RDC

The Donders Repository

  • 1.
    A short overviewof the Donders Repository https://data.donders.ru.nl Presentation for the Erwin Hahn Institute - 3 February 2021 Robert Oostenveld r.oostenveld@donders.ru.nl
  • 2.
    The Donders Repository Outline Theaims of the Donders Repository Procedural design Technical architecture Fitting it into the researchers daily work What data goes where? The timeline of a project and its data Closing collections, review and versions Making open data FAIR BIDS as specific standard for neuroimaging data Demonstration Automatic data flow Informed consent and GDPR Data Use Agreements
  • 3.
    About myself Member ofthe Research Infrastructure Committee since 2003. MEG physicist since 2009. Affiliated researcher at Karolinska Institutet since 2015. Associate PI at DCCN since 2020. Main research interest in the development of data analysis methods for MEG/EEG, such as source reconstruction, spectral analysis methods and stats. Also strong interest in Open Science and Team Science. Shortly after the “F.C. Donders Centre” started, I initiated the FieldTrip project (together with colleagues). In 2010 I got involved in the Human Connectome Project. In 2014 I got involved with the Donders- and RU-wide RDM efforts, together with Eric Maris, Erik van der Boogert, and Hurng-Chung Lee (the DR “core team”).
  • 4.
    Aims for theDonders Repository Research initiation Data acquisition Data analysis & documentation Data sharing Secure the original research data Document the research process Make the data accessible to the right people Your PI, your current collaborators, your future successors Your audience, i.e. sharing of open data Keep the data accessible in the future (beyond your contract)
  • 5.
    The Donders Repository Summaryof the development timeline Active contributors: University board, Legal department, Security officer, central IT department, Research Information Services (library etc.), other interested institutes 2014: initial planning 2015: design of data management protocols and specification of IT requirements 2016: first implementation accessible to researchers 2016-2018:refinements and improvements, increase adoption 2019-now: scale up to the whole Radboud University, improvements to scalability The Donders Repository is presently used by 1873 researchers to manage their research data, organized in approximately 1500 collections with about 150 TB data. There are 144 published (open access) data sharing collections. https://data.donders.ru.nl https://data.ru.nl
  • 6.
    The Donders Repository Proceduraldesign Different roles: administration, managers, contributors, viewers Different collections: for raw data (Data Acquisition Collection, DAC) for processed data (Research Documentation Collection, RDC) for publicly shared data (Data Sharing Collection, DSC) Collection states: Open/editable (read-write) Internal/external review (read-only) Archived or Published (permanent read-only, DOI) It should allow for large data (1000s of files, 100s of GB) that are organized per collection by the researcher. No zip files required, no limits. Authenticated access from inside and outside the institute. Not a replacement for the (much faster) work-in-progress storage systems.Suitable as a long- term (>10 year) archive. It should be scalable and grow along with our needs, also when IT subsystems change. Research initiation Data acquisition Data analysis & documentation Data sharing
  • 7.
    The Donders Repository Technicalarchitecture IRODS/ICAT Low-level (meta)data management software JAVA middleware JAVA frontend web access Elastic stack (ELK) WebDAV future … file access Scalable network-attached storage system Isilon, Compellent, CephFS, future … Replication storage IRODS/ICAT future … Federated IDP Surfconext & ORCID (SAML) past … DataCite DOI
  • 8.
    The Donders Repository Thinkingalong with the researchers’ struggles with their data What to do when? - While preparing the project… - While acquiring the data… - When implementing the analysis pipeline… - When finalizing the manuscript for submission… Making it attractive for our (junior) researchers. This is part of the embedding that researchers get at the DCCN, it also includes support for ethics, experimental design, acquisition, analysis, high- performance computing, etc. The typical PhD student or Postdoc goes through this cycle a few times while at the DCCN. Research initiation Data acquisition Data analysis & documentation Data sharing
  • 9.
    Scanners and labs DondersRepository Central storage /project/30xx0yy.zz DICOM, physiology, eyetracking, MEG, presentation log files, questionaires, … “raw” analysis scripts and intermediate results ”shared” DAC DSC RDC convert, deface, etc de-identified data and some results
  • 10.
    The Donders Repository Thetimeline of a project and its data Researchers present a project proposal, gets a PPM number The administrator creates a data acquisition collection, the PI and the researcher are usually both managers The researcher collects and archives raw data The researcher analyzes the data The researcher writes and submits a manuscript The administrator creates a research documentation and a data sharing collection The researcher moves the processed data to the archive The researcher moves the to-be-published-data to the archive The three collections are closed Research initiation Data acquisition Data analysis & documentation Data sharing
  • 11.
    Reviewing and archivingdata using the Donders Repository Collection states for internal and for raw data Editable/ open Internal review Archived This creates a new version with the same DOI, the old version remains available as well
  • 12.
    Reviewing and publishingdata using the Donders Repository Collection states for shared/published data Editable/ open Internal review External review Published This creates a new version with the same DOI, the old version remains available as well
  • 13.
    The Donders Repository MakingOpen Data FAIR Findable Make your data available on repository with a persistent identifier (DOI, handle) and metadata Accessible Be explicit about data usage terms (agreement with downloader) Interoperable Make your data human and machine readable, e.g. BIDS Reusable Make sure you document enough details, e.g. as “data descriptor” paper which can be cited, along with citing your data -> measurable impact!
  • 14.
    The Donders Repository MakingOpen Data FAIR The Donders Repository takes care of  Storage  Procedures and protocols  Roles and responsibilities  Long-term management  Authentication and authorization  Internal data flow/access  Data use agreements  External data access  Pushing metadata to RIS, NARCIS and Google
  • 15.
    The Donders Repository MakingOpen Data FAIR The Donders Institute is very broad, with 4 centres over 3 faculties, 80 principal investigators and some 800 researchers. We don’t impose explicit standards for how to organize and store generic neuroscience data or metadata. Only minimal metadata at the collection level. Multiple domain-specific standards needed for I and R.
  • 16.
    BIDS is away to organize your existing raw data To improve consistent and complete documentation To facilitate re-use by your future self and others BIDS is not A new file format A search engine A data sharing tool Making human neuroimaging data FAIR https://bids-standard.org https://github.com/Donders-Institute/bidscoin
  • 17.
    Making human neuroimagingdata FAIR https://bids-standard.org https://github.com/Donders-Institute/bidscoin
  • 18.
    The Donders Repository Summary Theaims of the Donders Repository Procedural design Technical architecture Fitting it into the researchers daily work What data goes where? The timeline of a project and its data Closing collections, review and versions Making open data FAIR BIDS as specific standard for neuroimaging data Demonstration Automatic data flow Informed consent and GDPR Data Use Agreements
  • 19.
  • 21.
    Demo datasets: https://doi.org/10.34973/3jk5-6j57 :mouse data, CC-BY https://doi.org/10.34973/j05g-fr58 : “Dr Who” MRI, RU-DI-HD
  • 22.
    The Donders Repository DataUse Agreement - Data Sharing Collections Data use agreement for identifiable human data Version RU-DI-HD-1.0 I request access to the data collected in the digital repository of the Donders Institute for Brain, Cognition and Behaviour, part of the Radboud University, established at Nijmegen, the Netherlands (hereinafter referred to as the Donders Institute), and I agree to the following: 1. I will comply with all relevant rules and regulations imposed by my institution and my government. This may mean that I need my research to be approved or declared exempt by a committee that oversees research on human subjects, e.g. my Institutional Review Board or Ethics Committee. 2. I will not attempt to establish the identity of or attempt to contact any of the included human subjects. I will not link this data to any other database in a way that could provide identifying information. I understand that under no circumstances will the code that would link these data to an individuals personal information be given to me, nor will any additional information about individual subjects be released to me under these Data Use Terms. 3. I will not redistribute or share the data with others, including individuals in my research group, unless they have independently applied and been granted access to this data. 4. I will acknowledge the use of the data and data derived from the data when publicly presenting any results or algorithms that benefitted from their use. (a) Papers, book chapters, books, posters, oral presentations, and all other presentations of results derived from the data should acknowledge the origin of the data as follows: "Data were provided (in part) by the Donders Institute for Brain, Cognition and Behaviour". (b) Authors of publications or presentations using the data should cite relevant publications describing the methods developed and used by the Donders Institute to acquire and process the data. The specific publications that are appropriate to cite in any given study will depend on what the data were used and for what purposes. When applicable, a list of publications will be included in the collection. (c) Neither the Donders Institute or Radboud University, nor the researchers that provide this data should be included as an author of publications or presentations if this authorship would be based solely on the use of this data. 5. Failure to abide by these guidelines will result in termination of my privileges to access to these data. I will not attempt to establish the identity of or attempt to contact any of the included human subjects. I will not link this data to any other database that could provide identifying …
  • 23.
    Typical (inefficient) reuseof raw data acquisition PPM DAC analysis publication DSC RDC analysis publication DSC RDC analysis publication DSC RDC Year 0 Year N Never
  • 24.
    More efficient reuseof shared/published data acquisition PPM DAC specific analysis publication RDC common preproc. DSC specific analysis publication RDC specific analysis publication RDC

Editor's Notes

  • #9 Relevant to mention: many data management systems have a web interface, but you cannot reliably download or upload many files and nested directories in the browser, only zip files. And that imposes constraints on the size and organization.
  • #10 Web access is mainly for metadata and management File access is for moving data in and out