Open Science, Open Data and BIDS for EEG
Robert Oostenveld
Donders Institute, Radboud University, Nijmegen, NL
Karolinska Institutet, Stockholm, SE
r.oostenveld@donders.ru.nl
These slides will
be shared online
on slideshare
Outline of this session
Dorothy Bishop – simulate and pre-register for more reproducible EEG
Aina Puce – better and detailed reporting of results
Robert Oostenveld – sharing of data and analysis details
What is Open Science?
Open educational resources
Open access publications
Pre-registration
Open peer review
Open methodology
Open source
Open hardware
Open data
What is Open Science?
Library of Charcot - ICM
Science – methods and tools are changing
Open Science – infrastructure and tools
Git and GitHub, Gitlab, BitBucket
Work together on code for analyses
Open Science Framework (osf.io)
Work together on documenting
DataVerse, Zenodo, etc
Sharing of data
Code Ocean, Microsoft Azure, Anaconda Clould
Cloud-based computational reproducibility platform
Past - Black-and-white version of article printed on dead trees
Present - PDF for download, sometimes online supplementary material
Future - Online notebooks that reproduces the results in detail
Lab notebook
Science is getting more exiting – but also harder in some ways
Open Science – planning ahead
Planning your analysis
Planning and publishing primary outcomes
Writing your scientific papers
Writing your PhD thesis
Public outreach
Planning and publishing secondary outcomes
Publishing details on the methods
Data management plan
Publishing your data
Sharing primary and secondary outcomes
Publication with the primary findings
To the wider audience
To your scientific peers
Methods
Protocol
Stimulus material
Analysis methods
Original data
Details on the results
https://en.wikipedia.org/wiki/IMRaD
Open Science – planning ahead
Planning your analysis
Planning and publishing primary outcomes
Writing your scientific papers
Writing your PhD thesis
Public outreach
Planning and publishing secondary outcomes
Publishing details on the methods
Data management plan
Publishing your data
Share/publish your methods
More details in your analysis than fits in your “Methods” section
Not possible to describe details in human-oriented text
Batch scripting
MATLAB, Python, R, SPSS, Bash, …
Analysis script corresponds to computer code
Version management tools for source code
Git, Subversion, Mercurial
GitHub, Gitlab, BitBucket
Version control - linear
V1
V2
V3
V4
2018-02-24
2018-03-16
2018-05-30
2018-06-05
Version control – branching …
V1
V2
V3-YoursV3-CoAuth1 V3-CoAuth2
V4-Merged
Version control – branching and merging
Version control – branching multiple analyses
V1
V2
V3-bV3-a V3-failed
V4-bV4-a
V5
Version control – collaborating
V1
V2
V3-bV3-a
V3-
failed
V4-bV4-a
V5
V1
V2
V3-bV3-a
V3-
failed
V4-bV4-a
V5
V1
V2
V3-bV3-a
V3-
failed
V4-bV4-a
V5
Your copy on
your computer
Your copy on
github
Someone else’s
copy on github
V5-a
V6-a
V5-a
V6-a
V5-a
V6-a
V7
Open Science – version control
Multiple versions/editions of your analysis scripts
Release when you think it is ready, i.e. upon publication
New revision when it has been improved
Versions of software … at a time scale of years/months/weeks
Editions of books … at a time scale of decennia
Original scientific data stays constant, but its interpretation
may change over time.
Data Management Plan
Think about the data that you will collect
and how to document it
… since you want others to re-use your data
Document the details of your data, e.g. in a “codebook”
… since you want to use data collected by others
To learn new analysis skills
As pilot
For (re)analysis
… since you want to re-use your own data
Write documentation for your “future self”
Open Data
Findable
Data and supplementary materials have sufficiently rich metadata
and a unique and persistent identifier.
Accessible
Data is deposited in a trusted repository.
Authentication and authorization procedure where necessary.
Interoperable
(Meta)data uses a formal, shared, and broadly applicable language or format.
Reusable
Data is described with clear and understandable attributes.
There should be a clear and acceptable license for re-use.
https://www.force11.org/group/fairgroup/fairprinciples
Open Science - data
Shared data allows for
Improved reproducibility
Pooling, small effects that require large group sizes
Data mining, discovery science and generating new hypothesis
Results in methodological opportunities
Improve algorithms
Estimate effect and group size
Make informed decisions on analysis pipeline
Prevent harking and p-hacking
Data from human participants
General Data Protection Regulation (GDPR)
Challenges:
Explicit and strict protection of personal data
Opportunities:
Less influence of national legislation differences
Learn from each other
Develop best practices
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=uriserv:OJ.L_.2016.119.01.0001.01.ENG
Open data versus privacy
Personal data
name
address
date of birth
phone number
license plate number
IP address
...
Crime Scene Investigation
http://www.abc.net.au/news/2017-09-19/csi/8960590
(Biometric) data
facial details
dental record
fingerprint
genetics
cortical folding pattern
clinical data
cortical response to stimulation
responses to a questionaire
Personal Data is needed
and should be managed
Required for administration
Contacting your participants
Paying your participants
Follow up incidental findings
Often not required to address the research question
Sometimes used as confound
Check whether the sample is representative
Possibly required to assess scientific integrity
Personal Data
Personal data
Name, address, date of birth
Special personal data = “bijzondere persoonsgegevens in NL”
Race
Religion or beliefs
Health
Sexual activities
Political preference, membership of a union
Criminal record
Indirect personal data – identifies someone … when linked to another database
Fingerprint, DNA, facial details
Anatomical MRI
Specific pattern of data (e.g. answers on a questionnaire or interview)
https://autoriteitpersoonsgegevens.nl/nl/over-privacy/persoonsgegevens/wat-zijn-persoonsgegevens
Gradient between
personal and research data
indirect personal
data
personal data
a lot of research data
easy easyhard
Keep private
and don’t share
Share as it is
with others
?
Limit possible identification
Anonymous
Nobody is able to identify the participant
Pseudonymization
Use a code instead of the participants name
De-identification
Remove (indirectly) identifying features
Blur the indirect personal data
Deface anatomical MRI
Age at the time of acquisition instead of date of birth
Use age bins instead of years
Questionnaire outcomes rather than individual item scores
…
Appropriate blurring
depends on the situation
… for example the age of the participant
1 month bins 10 year bins
Personal and research data
indirect personal
data
personal data
a lot of research data
Personal and research data
data minimization
pseudonymization
data minimization
de-identifying, blurring
alotofresearchdata
personaldata
indirect
personaldata
Share
responsibly with
legal constraints
on reuse
Keep safe
and private
Legal constraints
Contract between the researcher
… and the funding agency
… and the ethics committee
… and the participants/patients
… and the publisher of the results
… and the recipient of the data upon sharing
Legal constraints – Data Use Agreement
CC0 - Public Domain
No copyright.
The person who associated a work with this deed
has dedicated the work to the public domain by
waiving all of his or her rights to the work
worldwide under copyright law, including all related
and neighboring rights, to the extent allowed by law.
You can copy, modify, distribute and perform the
work, even for commercial purposes, all without
asking permission.
Donders Institute - Data Use Agreement
for identifiable human data
I will comply with all relevant rules and regulations
imposed by my institution and my government ….
I will not attempt to establish the identity of or attempt
to contact any of the included human subjects. I will not
link this data to any other database in a way that could
provide identifying information ….
I will not redistribute or share the data with others,
including individuals in my research group, unless they
have independently applied and been granted access to
this data.
I will acknowledge the use of the data and data derived
from the data when publicly presenting …
Failure to abide by these guidelines will result in
termination of my privileges to access to these data.
https://creativecommons.org/publicdomain/zero/1.0/
https://data.donders.ru.nl/doc/dua/
participant → you → recipient
Brain Imaging Data Structure
http://bids.neuroimaging.io
What is is?
BIDS is a way to organize your existing raw data
To improve consistent and complete documentation
To facilitate re-use by your future self and others
BIDS is not
A new file format
A search engine
A data sharing tool
BIDS for MRI, MEG, EEG … in future also iEEG, PET, eye-tracker, etc.
data/README
CHANGES
dataset_description.json
participants.tsv
/sub-01/anat/…
/sub-01/meg/…
/sub-01/eeg/sub-01_task-auditory_eeg.edf
/sub-01/eeg/sub-01_task-auditory_eeg.json
/sub-01/eeg/sub-01_task-auditory_channels.tsv
/sub-01/eeg/sub-01_task-auditory_events.tsv
/sub-01/eeg/sub-01_electrodes.tsv
/sub-01/eeg/sub-01_coordinates.json
EDF
BrainVision
Neuroscan
Biosemi
EEGLAB .set
Metadata in ”sidecar” files
Participants
Demographics
Questionaire outcomes
Equipment
Amplifier, cap, electrode type and placement
Filter settings, reference
Design, task and conditions
Instructions, stimuli material, responses
Trigger codes
Also some details from EEG data to make querying easier
Why use BIDS?
Developed with open community discussion
and involvement of experienced researchers
Neuroinformatics and analysis tools available for it
EEGLAB, FieldTrip, MNE-Python, BrainStorm
Increases the chance of your data being indexed and reused
(Future) applications for searching, automated analyses, …
But … it is more important that you share and what you share
than how you share it
Summary
New tools to be adopted for Open Science
Planning ahead for analysis and data
Version control and release of analysis details
Data management plan
Responsible sharing, considering your participants’ rights
Organizing EEG data according to BIDS
Suggested further reading
This presentation on
https://www.slideshare.net/robertoostenveld
https://opensciencemooc.eu
https://open-science-training-handbook.gitbooks.io/book
http://software-carpentry.org
http://bids.neuroimaging.org
http://data.donders.ru.nl

CuttingEEG - Open Science, Open Data and BIDS for EEG

  • 1.
    Open Science, OpenData and BIDS for EEG Robert Oostenveld Donders Institute, Radboud University, Nijmegen, NL Karolinska Institutet, Stockholm, SE r.oostenveld@donders.ru.nl These slides will be shared online on slideshare
  • 2.
    Outline of thissession Dorothy Bishop – simulate and pre-register for more reproducible EEG Aina Puce – better and detailed reporting of results Robert Oostenveld – sharing of data and analysis details
  • 3.
    What is OpenScience? Open educational resources Open access publications Pre-registration Open peer review Open methodology Open source Open hardware Open data
  • 4.
    What is OpenScience?
  • 5.
  • 6.
    Science – methodsand tools are changing
  • 7.
    Open Science –infrastructure and tools Git and GitHub, Gitlab, BitBucket Work together on code for analyses Open Science Framework (osf.io) Work together on documenting DataVerse, Zenodo, etc Sharing of data Code Ocean, Microsoft Azure, Anaconda Clould Cloud-based computational reproducibility platform Past - Black-and-white version of article printed on dead trees Present - PDF for download, sometimes online supplementary material Future - Online notebooks that reproduces the results in detail Lab notebook Science is getting more exiting – but also harder in some ways
  • 8.
    Open Science –planning ahead Planning your analysis Planning and publishing primary outcomes Writing your scientific papers Writing your PhD thesis Public outreach Planning and publishing secondary outcomes Publishing details on the methods Data management plan Publishing your data
  • 9.
    Sharing primary andsecondary outcomes Publication with the primary findings To the wider audience To your scientific peers Methods Protocol Stimulus material Analysis methods Original data Details on the results https://en.wikipedia.org/wiki/IMRaD
  • 10.
    Open Science –planning ahead Planning your analysis Planning and publishing primary outcomes Writing your scientific papers Writing your PhD thesis Public outreach Planning and publishing secondary outcomes Publishing details on the methods Data management plan Publishing your data
  • 12.
    Share/publish your methods Moredetails in your analysis than fits in your “Methods” section Not possible to describe details in human-oriented text Batch scripting MATLAB, Python, R, SPSS, Bash, … Analysis script corresponds to computer code Version management tools for source code Git, Subversion, Mercurial GitHub, Gitlab, BitBucket
  • 13.
    Version control -linear V1 V2 V3 V4 2018-02-24 2018-03-16 2018-05-30 2018-06-05
  • 14.
    Version control –branching … V1 V2 V3-YoursV3-CoAuth1 V3-CoAuth2 V4-Merged Version control – branching and merging
  • 15.
    Version control –branching multiple analyses V1 V2 V3-bV3-a V3-failed V4-bV4-a V5
  • 16.
    Version control –collaborating V1 V2 V3-bV3-a V3- failed V4-bV4-a V5 V1 V2 V3-bV3-a V3- failed V4-bV4-a V5 V1 V2 V3-bV3-a V3- failed V4-bV4-a V5 Your copy on your computer Your copy on github Someone else’s copy on github V5-a V6-a V5-a V6-a V5-a V6-a V7
  • 17.
    Open Science –version control Multiple versions/editions of your analysis scripts Release when you think it is ready, i.e. upon publication New revision when it has been improved Versions of software … at a time scale of years/months/weeks Editions of books … at a time scale of decennia Original scientific data stays constant, but its interpretation may change over time.
  • 19.
    Data Management Plan Thinkabout the data that you will collect and how to document it … since you want others to re-use your data Document the details of your data, e.g. in a “codebook” … since you want to use data collected by others To learn new analysis skills As pilot For (re)analysis … since you want to re-use your own data Write documentation for your “future self”
  • 20.
    Open Data Findable Data andsupplementary materials have sufficiently rich metadata and a unique and persistent identifier. Accessible Data is deposited in a trusted repository. Authentication and authorization procedure where necessary. Interoperable (Meta)data uses a formal, shared, and broadly applicable language or format. Reusable Data is described with clear and understandable attributes. There should be a clear and acceptable license for re-use. https://www.force11.org/group/fairgroup/fairprinciples
  • 21.
    Open Science -data Shared data allows for Improved reproducibility Pooling, small effects that require large group sizes Data mining, discovery science and generating new hypothesis Results in methodological opportunities Improve algorithms Estimate effect and group size Make informed decisions on analysis pipeline Prevent harking and p-hacking
  • 22.
    Data from humanparticipants General Data Protection Regulation (GDPR) Challenges: Explicit and strict protection of personal data Opportunities: Less influence of national legislation differences Learn from each other Develop best practices https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=uriserv:OJ.L_.2016.119.01.0001.01.ENG
  • 23.
  • 24.
    Personal data name address date ofbirth phone number license plate number IP address ... Crime Scene Investigation http://www.abc.net.au/news/2017-09-19/csi/8960590
  • 25.
    (Biometric) data facial details dentalrecord fingerprint genetics cortical folding pattern clinical data cortical response to stimulation responses to a questionaire
  • 26.
    Personal Data isneeded and should be managed Required for administration Contacting your participants Paying your participants Follow up incidental findings Often not required to address the research question Sometimes used as confound Check whether the sample is representative Possibly required to assess scientific integrity
  • 27.
    Personal Data Personal data Name,address, date of birth Special personal data = “bijzondere persoonsgegevens in NL” Race Religion or beliefs Health Sexual activities Political preference, membership of a union Criminal record Indirect personal data – identifies someone … when linked to another database Fingerprint, DNA, facial details Anatomical MRI Specific pattern of data (e.g. answers on a questionnaire or interview) https://autoriteitpersoonsgegevens.nl/nl/over-privacy/persoonsgegevens/wat-zijn-persoonsgegevens
  • 28.
    Gradient between personal andresearch data indirect personal data personal data a lot of research data easy easyhard Keep private and don’t share Share as it is with others ?
  • 29.
    Limit possible identification Anonymous Nobodyis able to identify the participant Pseudonymization Use a code instead of the participants name De-identification Remove (indirectly) identifying features Blur the indirect personal data Deface anatomical MRI Age at the time of acquisition instead of date of birth Use age bins instead of years Questionnaire outcomes rather than individual item scores …
  • 30.
    Appropriate blurring depends onthe situation … for example the age of the participant 1 month bins 10 year bins
  • 31.
    Personal and researchdata indirect personal data personal data a lot of research data
  • 32.
    Personal and researchdata data minimization pseudonymization data minimization de-identifying, blurring alotofresearchdata personaldata indirect personaldata Share responsibly with legal constraints on reuse Keep safe and private
  • 33.
    Legal constraints Contract betweenthe researcher … and the funding agency … and the ethics committee … and the participants/patients … and the publisher of the results … and the recipient of the data upon sharing
  • 34.
    Legal constraints –Data Use Agreement CC0 - Public Domain No copyright. The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. Donders Institute - Data Use Agreement for identifiable human data I will comply with all relevant rules and regulations imposed by my institution and my government …. I will not attempt to establish the identity of or attempt to contact any of the included human subjects. I will not link this data to any other database in a way that could provide identifying information …. I will not redistribute or share the data with others, including individuals in my research group, unless they have independently applied and been granted access to this data. I will acknowledge the use of the data and data derived from the data when publicly presenting … Failure to abide by these guidelines will result in termination of my privileges to access to these data. https://creativecommons.org/publicdomain/zero/1.0/ https://data.donders.ru.nl/doc/dua/ participant → you → recipient
  • 36.
    Brain Imaging DataStructure http://bids.neuroimaging.io
  • 37.
    What is is? BIDSis a way to organize your existing raw data To improve consistent and complete documentation To facilitate re-use by your future self and others BIDS is not A new file format A search engine A data sharing tool
  • 38.
    BIDS for MRI,MEG, EEG … in future also iEEG, PET, eye-tracker, etc. data/README CHANGES dataset_description.json participants.tsv /sub-01/anat/… /sub-01/meg/… /sub-01/eeg/sub-01_task-auditory_eeg.edf /sub-01/eeg/sub-01_task-auditory_eeg.json /sub-01/eeg/sub-01_task-auditory_channels.tsv /sub-01/eeg/sub-01_task-auditory_events.tsv /sub-01/eeg/sub-01_electrodes.tsv /sub-01/eeg/sub-01_coordinates.json EDF BrainVision Neuroscan Biosemi EEGLAB .set
  • 39.
    Metadata in ”sidecar”files Participants Demographics Questionaire outcomes Equipment Amplifier, cap, electrode type and placement Filter settings, reference Design, task and conditions Instructions, stimuli material, responses Trigger codes Also some details from EEG data to make querying easier
  • 40.
    Why use BIDS? Developedwith open community discussion and involvement of experienced researchers Neuroinformatics and analysis tools available for it EEGLAB, FieldTrip, MNE-Python, BrainStorm Increases the chance of your data being indexed and reused (Future) applications for searching, automated analyses, … But … it is more important that you share and what you share than how you share it
  • 42.
    Summary New tools tobe adopted for Open Science Planning ahead for analysis and data Version control and release of analysis details Data management plan Responsible sharing, considering your participants’ rights Organizing EEG data according to BIDS
  • 43.
    Suggested further reading Thispresentation on https://www.slideshare.net/robertoostenveld https://opensciencemooc.eu https://open-science-training-handbook.gitbooks.io/book http://software-carpentry.org http://bids.neuroimaging.org http://data.donders.ru.nl

Editor's Notes

  • #2 Vragen aan eind Verzoek van subject om zijn data te wissen -> informed consent procedure Beschrijving van metadata -> koppeling aan externe ontologies
  • #3 Dorothy: simulations, dummy conditions, replicate yourself, pre-registration
  • #4 Review and critical evaluation beyond publication – open methods, tools and data
  • #5 Open Science touches upon each aspect of the research cycle, you start using it when thinking and planning, all the way through maximizing your impact
  • #6 These have already been discussed in detail by Adrienn in the previous lecture
  • #11 OSF is the service by the Center for Open Science
  • #13 Introduction, Methods, Results and Discussion The closer your scientific peers are, the more interest they will have in the middle section
  • #17 This is something you know, not only from code but also from manuscripts that you write
  • #29 Personal data is what the CSI will search for .. And if they cannot find it they will look at biometric data