Data Governance in two different data archives: When is a federal data repository useful
Data Governance in Two Different Data Archives: When
is a Federal Data Repository Useful?
Director, Office of Technology Development and Coordination
National Institute of Mental Health
National Institutes of Health
1) Most research subjects want their data to be used to understand
disease broadly. They are not too concerned about how researchers
use their data.
2) The diseases we are trying to understand today are complex meaning
that the same symptoms can have many different underlying biological
causes. Except in the cases where a deeply penetrant point mutation
uncovers a single biological pathway to a disease, understanding the
“subgroups” for complex diseases requires data from large populations
who have similar symptoms.
3) Differences in data sharing laws in different countries makes it difficult
or impossible to move data across international borders. Federating
data archives that are storing data in a similar way provides an
inelegant but workable solution to this problem.
4) Despite the urgent need to aggregate data to understand complex
diseases, individual consents and local laws must be respected.
Policy Considerations can be Manipulated to Become
an Excuse Not to Share Data
• Contrast two data archives that have built the infrastructure necessary
to aggregate data on complex diseases.
• NIMH Data Archive (NDA) Overview
▪ Federal Data Repository where the data are owned by the US National
Institutes of Health
▪ Policy Issues
• Human Connectome Program (HCP)
▪ Large NIH funded project
▪ Access to most data was by self certification
▪ Initial Data Distribution was through Washington University
• Stores data from experiments involving human subjects that are
deposited by research laboratories.
▪ Federal data repository
▪ Originally contained data from human subjects related to mental illness
(and control subjects), but that has expanded in a number of ways over the
past 12 months. Most subjects have consented to broad data sharing.
▪ Data are available to the research community through a not too difficult
▪ Both submission and access to subject level data require approval of an
▪ Summary data are available to everyone with a browser (https://data-
• Begun in late 2006, and first data was received in 2008
• The data types include demographic data, clinical assessments,
imaging data, and –omic data. There are no formal limits to the types
of data that can be stored in NDA.
NIMH Data Archive
• The NDA currently makes data available to the research community
from 200,000 subjects. Additional data are held by the NDA but are
not yet ready for sharing because the grant is still active and/or has
not published papers.
• Many subjects have longitudinal data.
• ~1.1 PB of imaging and –omic data is securely stored in the Amazon
• Currently, the NDA does not contain any personally identifiable
information, but we expect to begin holding such data in the near
future (data from mobile devices).
▪ This change will likely require that NDA verify that the use of the data has
been approved by an Institutional Review Board.
NIMH Data Archive – Current Size and Scope
• It is best to think of NDA as a large (~182,000 data elements by
~200,000 people), sparse, two dimensional matrix.
NDA Structure – Rows and Columns are the Building
• The NDA data dictionary is one of the key building blocks for this
repository. It provides a flexible and extensible framework for data
definition by the research community.
• 2,000+ instruments, freely available to anyone
▪ 180,000+ unique data elements and growing
▪ Data dictionaries describing
• MRI Modalities
• Other complex data (EEG, eye tracking)
• Accommodates any data type and data structure
• Describes the data collected by the research community
Data Dictionary – The First Building Block
• Curated by NDA (this takes a lot of time)
• Data held in different archives needs to use common data
dictionaries to allow deep federation.
• The associated validation tool allows investigators to
quickly perform quality control tests of their data without
submitting data anywhere.
• Data in archives that don’t have a similar QC step are
likely to have issues.
• Both to enhance the quality of the science and to ensure
that the time and effort that research subjects are
spending in our research protocols, the validation tool
should be run frequently (daily, weekly). This is common
practice in many other domains.
Data Dictionary – The First Building Block
• The NDA GUID software allows any researcher
to generate a unique identifier using some
information from a birth certificate.
• If the same information is entered in different
laboratories, the same GUID will be generated.
• This strategy allows NDA to aggregate data on
the same subject collected in multiple
laboratories without holding any of the personally
identifiable information about that subject.
• The GUID is now being discussed in a number of
additional research communities. We think we
have a reasonable plan to prevent a GUID from
becoming something like a social security
number (which would be identifying in itself)
• External studies indicate that the GUID
implementation is pretty robust both to false
positives and false negatives in large
Global Unique Identifier – the Other Building Block
At this point, data has been received from the laboratory
that measured the data. Each subject has a GUID or a
pseudo-GUID. A data dictionary has have been defined,
and the submitted data have been validated against that
How does an outside user find data they are interested
An Example of Data Associated with a Particular Laboratory
Now assigning DOIs to
each study, and we can
track how often a DOI
link is clicked (the start
of a data citation)
• Assertion: Any consent language that restricts the use of the data
for particular purposes (for autism research…) results in profound
• For example, if a researcher is trying to aggregate data between
subjects with schizophrenia and autism to understand common
symptoms that are observed in the two diagnostic groups, a consent that
limited a dataset for use only to understand one of those diagnostic
conditions would probably mean the data is not accessible for a
• Restricted data are also probably off limits for those who are trying to
use data mining techniques to develop or substantiate a hypothesis.
• There are some cases where restrictive consents might be appropriate,
but this should be the rare exception.
Policy – Consents
• NIMH expects that
research we pay for
involving human subjects
will result in that data
being made available in
• Journals can also have a
positive role to play in
requiring that data be
placed in a repository
prior to publication.
• Asking for data volunteers
probably isn’t good
enough right now.
Policy – Data Deposition
• Summary data are available to anyone via the web site, but
accessing subject level data requires a data access form.
• Similarly, a data submission agreement is required that
certifies that the data were consented for sharing.
• Both forms require the signature from the PI and an
institutional official. This means that the research
institution is formally responsible for ensuring that the
data are “treated with respect”.
• Although neither form is complicated, they do raise barriers
to accessing the data.
Policy – NDA Data Access and Data Submission
• The NIMH Data Archive does hold some data that were
collected outside the US.
• For those datasets, the institutional official has decided that
depositing data is allowed both by the terms of the informed
consent and by the laws in that country.
• When there are restrictions to allowing data to be moved, it
is still possible to make it easy for the research community
to find data by federating data archives.
Policy – Data from Institutions Outside the US
• For NDA, submitting data is separate from sharing that data
with the research community.
• Data are shared when the grant is complete or when a
paper is published.
• Other sharing timelines are possible.
• No matter when the data are shared, data need to be
submitted on a regular basis. This ensures that the data
from a grant award has been submitted before funding is
exhausted. More importantly, periodic data submission
ensures that the data have undergone basic QC checks as
they are collected.
Policy – Timeline for Data Sharing
• Responding to a number of instances of high visibility/impact
experiments that were not thoughtfully designed, NIH (and
NIMH) have instituted a number of programs to enhance
rigor and reproducibility in research supported by NIH.
• These discussions with the community started in June 2012.
The new guidelines to increase rigor and reproducibility are
outlined in NOT-OD-15-103 and at a web site
• Data archives plays an important role in improving the rigor
and reproducibility of NIMH funded research.
Rigor and Reproducibility – Data Archives Help
• Data dictionaries are a key part of the NDA infrastructure. Each item in a data
dictionary has an allowable range of values. The NDA has a validation tool that
allows users to check a data set to see if it conforms with the allowable ranges and
formats in a data dictionary.
• Because of our mandated data deposition schedule the validation tool allows labs
to find errors every 6 months when data are deposited (or more often if they
Rigor and Reproducibility - 2
• The NDA makes it easy to identify the data associated with a publication, and we
assign a doi to that dataset to make it trivial for the research community to find the
• Identifying the data from a publication allows researchers to look at all of the data
collected under an award and compare that to the data used in the publication.
Rigor and Reproducibility - 3
• There are “professional” research participants who seem to make a
living volunteering for clinical studies. Websites exist that make it
relatively easy for such participants to find out the right answers to
screening questions to be admitted to a study. Clearly, this can be
dangerous to the volunteer and can also put the rigor and reproducibility
of the study at risk.
• Recruitment in certain diagnostic categories may take place in a small
number of clinical centers. This means that papers from many different
research groups may be sampling from a smaller population than the
“independent” papers might suggest.
• The NDA GUID helps the research community understand the size of
these problems and deal with these issues.
• There are also commercial services that aid in the screening for
someone who is participating in multiple clinical trials.
GUIDs and Rigor/Reproducibility
1) NIMH Data Archive – very heterogeneous data collected in multiple
laboratories. NDA attempts to aggregate this data using a global
unique identifier system as well as data dictionaries to describe the
2) Human Connectome Program – heterogeneous data (clinical
assessments, imaging, MEG, genomics) collected using a common
protocol. The first phase of this project involved data collection from
typical research subjects at a single site. The project has recently
expanded to include data collected across the lifespan for control
subjects as well as from subjects with a diagnosis. Those datasets are
collected at multiple laboratories, but still use similar data collection
Two Different Sorts of Data Archives
• The NIH Human Connectome Project (HCP) is supported by the NIH
Neuroscience Blueprint ICs
• The HCP is an ambitious effort to map the neural pathways that underlie
human brain function. The overarching purpose of the Project is to
acquire and share data about the structural and functional connectivity of
the human brain. It has greatly advance the capabilities for imaging and
analyzing brain connections, resulting in improved sensitivity, resolution,
and utility, thereby accelerating progress in the emerging field of human
• Phase 1 of the HCP resulted in two awards
■ David Van Essen and Kamil Ugurbil, Wash U and U Minnesota
■ Bruce Rosen, MGH
Human Connectome Project
1) Deliver advanced MRI scanners and
techniques with high spatial and
temporal resolution for functional
and diffusion MRI.
■ Both the MGH and the Wash U MRIs
worked as designed and were able
to collect data quickly. Siemens
learned a great deal from
collaborating on both instruments,
and their new family of 3T MRIs (the
Prisma) has operating characteristics
similar to the Wash U scanner.
■ The supplements to port the pulse
sequences to other laboratories and
to other manufacturers has also been
Phase 1 Connectome Accomplishments
2) Deliver high quality data to the research community
■ Wash U has released data from 1200 subjects. This includes
behavioral assessments, structural MRIs, rs fMRI, task fMRI, and
diffusion experiments. MEG data has also just been released. This is
the first time that a large imaging award adopted “genome speed” data
■ Data from MGH are being made available on their web site as well as
at the Wash U web site.
■ The data are being widely used by the research community. More
than 100 papers cited the Wash U grant at the point where data
collection was only half complete.
■ High visibility papers have appeared. Researchers from outside the
WU-Minn collaboration have authored some of those papers.
Phase 1 Connectome Accomplishments
• Based on the results from the original connectome project (which
collected data on 22-34 year old healthy subjects), NIH decided to fund
awards for a lifespan connectome. Three awards have been made that
will cover the age range from birth to the oldest old (90+).
• In addition, NIH has funded 14 awards to measure connectomics on
groups that have some sort of diagnosis (Alzheimers, low vision,
dementia, epilepsy, mood and anxiety disorders, psychosis, …).
• Over 8,000 subjects are participating and nearly 12,000 scans are
expected in the data infrastructure by 2021. Phenotypic and clinical
assessments as well as other non-MRI data are being collected and
• In addition, the Adolescent Brain Cognitive Development (ABCD) study
has chosen to use the connectome data collection protocol. That study
intends to enroll 10,000 children aged 9-10 and follow them into early
adulthood. This dataset requires a data access agreement.
• A Connectome Coordination Facility has been created to hold all of the
• The original HCP consents allowed almost unlimited access to the data
(clinical and phenotypic data as well as the MRI and MEG data).
• An individual who wanted data simply enters a working web site into the
registration system and certifies that the will not attempt to re-identify
any of the research participants.
• Many of the original participants are part of the Missouri twin study. This
caused some of the measured data (family structure, substance use) to
be declared sensitive. The sensitive data had a more restrictive data
Moving to NIMH Data Archive (NDA)
• Clearly resulted in a lot of data use – transfers, papers, …
• Open access probably helped the community to adopt HCP data
collection as the current standard
• Even in this open access data set, there is still some information that is
sensitive and requires approval. When the data were at Wash U, the
only penalty for misusing the data was loss of further access to the data.
• No penalties were ever imposed for mistreating data.
• ABCD early data availability (needs DAC) seems to be around the same
level as HCP – does this mean that researchers will do what it takes to
get good data?
• Probably the key question to think about when deciding between open
access and a more restrictive model is what penalties need to be
imposed if the data are not treated in accord with the data access
HCP – Open Access
The “WU-Minn”HCP consortium of the initial HCP Dataset
• Understanding complex diseases need lots of different data
from a variety of sources.
• Informed consents, national laws concerning data sharing,
and investigator preferences can all restrict the aggregation
• All of those issues can be solved, with some effort.
• If you have an option, deciding whether to share data under a
very open access model or in a federal database should be
made based on what needs to happen if the data are not
• Even though it is easier to get data from an open repository,
early results from the ABCD project suggest that users will
take the steps needed to get access to high quality data.