SlideShare a Scribd company logo
Data Governance in Two Different Data Archives: When
is a Federal Data Repository Useful?
Greg Farber
Director, Office of Technology Development and Coordination
National Institute of Mental Health
National Institutes of Health
March 2018
1) Most research subjects want their data to be used to understand
disease broadly. They are not too concerned about how researchers
use their data.
2) The diseases we are trying to understand today are complex meaning
that the same symptoms can have many different underlying biological
causes. Except in the cases where a deeply penetrant point mutation
uncovers a single biological pathway to a disease, understanding the
“subgroups” for complex diseases requires data from large populations
who have similar symptoms.
3) Differences in data sharing laws in different countries makes it difficult
or impossible to move data across international borders. Federating
data archives that are storing data in a similar way provides an
inelegant but workable solution to this problem.
4) Despite the urgent need to aggregate data to understand complex
diseases, individual consents and local laws must be respected.
Guiding Ideas
2
Policy Considerations can be Manipulated to Become
an Excuse Not to Share Data
3
• Contrast two data archives that have built the infrastructure necessary
to aggregate data on complex diseases.
• NIMH Data Archive (NDA) Overview
▪ Federal Data Repository where the data are owned by the US National
Institutes of Health
▪ Infrastructure
▪ Policy Issues
• Human Connectome Program (HCP)
▪ Large NIH funded project
▪ Access to most data was by self certification
▪ Initial Data Distribution was through Washington University
Roadmap
• Stores data from experiments involving human subjects that are
deposited by research laboratories.
▪ Federal data repository
▪ Originally contained data from human subjects related to mental illness
(and control subjects), but that has expanded in a number of ways over the
past 12 months. Most subjects have consented to broad data sharing.
▪ Data are available to the research community through a not too difficult
application process.
▪ Both submission and access to subject level data require approval of an
institutional official.
▪ Summary data are available to everyone with a browser (https://data-
archive.nimh.nih.gov/)
• Begun in late 2006, and first data was received in 2008
• The data types include demographic data, clinical assessments,
imaging data, and –omic data. There are no formal limits to the types
of data that can be stored in NDA.
NIMH Data Archive
• The NDA currently makes data available to the research community
from 200,000 subjects. Additional data are held by the NDA but are
not yet ready for sharing because the grant is still active and/or has
not published papers.
• Many subjects have longitudinal data.
• ~1.1 PB of imaging and –omic data is securely stored in the Amazon
cloud.
• Currently, the NDA does not contain any personally identifiable
information, but we expect to begin holding such data in the near
future (data from mobile devices).
▪ This change will likely require that NDA verify that the use of the data has
been approved by an Institutional Review Board.
NIMH Data Archive – Current Size and Scope
• It is best to think of NDA as a large (~182,000 data elements by
~200,000 people), sparse, two dimensional matrix.
NDA Structure – Rows and Columns are the Building
Blocks
8
• The NDA data dictionary is one of the key building blocks for this
repository. It provides a flexible and extensible framework for data
definition by the research community.
• 2,000+ instruments, freely available to anyone
▪ 180,000+ unique data elements and growing
▪ Data dictionaries describing
• Clinical
• Genomics/Proteomics
• MRI Modalities
• Other complex data (EEG, eye tracking)
• Accommodates any data type and data structure
• Describes the data collected by the research community
Data Dictionary – The First Building Block
• Curated by NDA (this takes a lot of time)
• Data held in different archives needs to use common data
dictionaries to allow deep federation.
• The associated validation tool allows investigators to
quickly perform quality control tests of their data without
submitting data anywhere.
• Data in archives that don’t have a similar QC step are
likely to have issues.
• Both to enhance the quality of the science and to ensure
that the time and effort that research subjects are
spending in our research protocols, the validation tool
should be run frequently (daily, weekly). This is common
practice in many other domains.
Data Dictionary – The First Building Block
• The NDA GUID software allows any researcher
to generate a unique identifier using some
information from a birth certificate.
• If the same information is entered in different
laboratories, the same GUID will be generated.
• This strategy allows NDA to aggregate data on
the same subject collected in multiple
laboratories without holding any of the personally
identifiable information about that subject.
• The GUID is now being discussed in a number of
additional research communities. We think we
have a reasonable plan to prevent a GUID from
becoming something like a social security
number (which would be identifying in itself)
• External studies indicate that the GUID
implementation is pretty robust both to false
positives and false negatives in large
populations.
Global Unique Identifier – the Other Building Block
Federation – The GUID Does Work
At this point, data has been received from the laboratory
that measured the data. Each subject has a GUID or a
pseudo-GUID. A data dictionary has have been defined,
and the submitted data have been validated against that
definition.
How does an outside user find data they are interested
in?
An Example of Data Associated with a Particular Laboratory
Now assigning DOIs to
each study, and we can
track how often a DOI
link is clicked (the start
of a data citation)
Results in 750 subjects
being discovered
• Assertion: Any consent language that restricts the use of the data
for particular purposes (for autism research…) results in profound
negative consequences.
• For example, if a researcher is trying to aggregate data between
subjects with schizophrenia and autism to understand common
symptoms that are observed in the two diagnostic groups, a consent that
limited a dataset for use only to understand one of those diagnostic
conditions would probably mean the data is not accessible for a
comparison study.
• Restricted data are also probably off limits for those who are trying to
use data mining techniques to develop or substantiate a hypothesis.
• There are some cases where restrictive consents might be appropriate,
but this should be the rare exception.
Policy – Consents
23
• NIMH expects that
research we pay for
involving human subjects
will result in that data
being made available in
NDA.
• Journals can also have a
positive role to play in
requiring that data be
placed in a repository
prior to publication.
• Asking for data volunteers
probably isn’t good
enough right now.
Policy – Data Deposition
24
• Summary data are available to anyone via the web site, but
accessing subject level data requires a data access form.
• Similarly, a data submission agreement is required that
certifies that the data were consented for sharing.
• Both forms require the signature from the PI and an
institutional official. This means that the research
institution is formally responsible for ensuring that the
data are “treated with respect”.
• Although neither form is complicated, they do raise barriers
to accessing the data.
Policy – NDA Data Access and Data Submission
25
• The NIMH Data Archive does hold some data that were
collected outside the US.
• For those datasets, the institutional official has decided that
depositing data is allowed both by the terms of the informed
consent and by the laws in that country.
• When there are restrictions to allowing data to be moved, it
is still possible to make it easy for the research community
to find data by federating data archives.
Policy – Data from Institutions Outside the US
26
• For NDA, submitting data is separate from sharing that data
with the research community.
• Data are shared when the grant is complete or when a
paper is published.
• Other sharing timelines are possible.
• No matter when the data are shared, data need to be
submitted on a regular basis. This ensures that the data
from a grant award has been submitted before funding is
exhausted. More importantly, periodic data submission
ensures that the data have undergone basic QC checks as
they are collected.
Policy – Timeline for Data Sharing
27
• Responding to a number of instances of high visibility/impact
experiments that were not thoughtfully designed, NIH (and
NIMH) have instituted a number of programs to enhance
rigor and reproducibility in research supported by NIH.
• These discussions with the community started in June 2012.
The new guidelines to increase rigor and reproducibility are
outlined in NOT-OD-15-103 and at a web site
(https://www.nih.gov/research-training/rigor-reproducibility).
• Data archives plays an important role in improving the rigor
and reproducibility of NIMH funded research.
Rigor and Reproducibility – Data Archives Help
28
• Data dictionaries are a key part of the NDA infrastructure. Each item in a data
dictionary has an allowable range of values. The NDA has a validation tool that
allows users to check a data set to see if it conforms with the allowable ranges and
formats in a data dictionary.
• Because of our mandated data deposition schedule the validation tool allows labs
to find errors every 6 months when data are deposited (or more often if they
choose).
Rigor and Reproducibility - 2
29
• The NDA makes it easy to identify the data associated with a publication, and we
assign a doi to that dataset to make it trivial for the research community to find the
data.
• Identifying the data from a publication allows researchers to look at all of the data
collected under an award and compare that to the data used in the publication.
Rigor and Reproducibility - 3
30
• There are “professional” research participants who seem to make a
living volunteering for clinical studies. Websites exist that make it
relatively easy for such participants to find out the right answers to
screening questions to be admitted to a study. Clearly, this can be
dangerous to the volunteer and can also put the rigor and reproducibility
of the study at risk.
• Recruitment in certain diagnostic categories may take place in a small
number of clinical centers. This means that papers from many different
research groups may be sampling from a smaller population than the
“independent” papers might suggest.
• The NDA GUID helps the research community understand the size of
these problems and deal with these issues.
• There are also commercial services that aid in the screening for
someone who is participating in multiple clinical trials.
GUIDs and Rigor/Reproducibility
31
NIH/NIMH Data Archives Staff
1) NIMH Data Archive – very heterogeneous data collected in multiple
laboratories. NDA attempts to aggregate this data using a global
unique identifier system as well as data dictionaries to describe the
myriad experiments.
2) Human Connectome Program – heterogeneous data (clinical
assessments, imaging, MEG, genomics) collected using a common
protocol. The first phase of this project involved data collection from
typical research subjects at a single site. The project has recently
expanded to include data collected across the lifespan for control
subjects as well as from subjects with a diagnosis. Those datasets are
collected at multiple laboratories, but still use similar data collection
protocols.
Two Different Sorts of Data Archives
• The NIH Human Connectome Project (HCP) is supported by the NIH
Neuroscience Blueprint ICs
• The HCP is an ambitious effort to map the neural pathways that underlie
human brain function. The overarching purpose of the Project is to
acquire and share data about the structural and functional connectivity of
the human brain. It has greatly advance the capabilities for imaging and
analyzing brain connections, resulting in improved sensitivity, resolution,
and utility, thereby accelerating progress in the emerging field of human
connectomics.
• Phase 1 of the HCP resulted in two awards
■ David Van Essen and Kamil Ugurbil, Wash U and U Minnesota
■ Bruce Rosen, MGH
Human Connectome Project
34
1) Deliver advanced MRI scanners and
techniques with high spatial and
temporal resolution for functional
and diffusion MRI.
■ Both the MGH and the Wash U MRIs
worked as designed and were able
to collect data quickly. Siemens
learned a great deal from
collaborating on both instruments,
and their new family of 3T MRIs (the
Prisma) has operating characteristics
similar to the Wash U scanner.
■ The supplements to port the pulse
sequences to other laboratories and
to other manufacturers has also been
successful.
Phase 1 Connectome Accomplishments
35
2) Deliver high quality data to the research community
■ Wash U has released data from 1200 subjects. This includes
behavioral assessments, structural MRIs, rs fMRI, task fMRI, and
diffusion experiments. MEG data has also just been released. This is
the first time that a large imaging award adopted “genome speed” data
release.
■ Data from MGH are being made available on their web site as well as
at the Wash U web site.
■ The data are being widely used by the research community. More
than 100 papers cited the Wash U grant at the point where data
collection was only half complete.
■ High visibility papers have appeared. Researchers from outside the
WU-Minn collaboration have authored some of those papers.
Phase 1 Connectome Accomplishments
36
• Based on the results from the original connectome project (which
collected data on 22-34 year old healthy subjects), NIH decided to fund
awards for a lifespan connectome. Three awards have been made that
will cover the age range from birth to the oldest old (90+).
• In addition, NIH has funded 14 awards to measure connectomics on
groups that have some sort of diagnosis (Alzheimers, low vision,
dementia, epilepsy, mood and anxiety disorders, psychosis, …).
• Over 8,000 subjects are participating and nearly 12,000 scans are
expected in the data infrastructure by 2021. Phenotypic and clinical
assessments as well as other non-MRI data are being collected and
made available.
• In addition, the Adolescent Brain Cognitive Development (ABCD) study
has chosen to use the connectome data collection protocol. That study
intends to enroll 10,000 children aged 9-10 and follow them into early
adulthood. This dataset requires a data access agreement.
Connectome Today
37
• A Connectome Coordination Facility has been created to hold all of the
data (https://www.humanconnectome.org/).
• The original HCP consents allowed almost unlimited access to the data
(clinical and phenotypic data as well as the MRI and MEG data).
• An individual who wanted data simply enters a working web site into the
registration system and certifies that the will not attempt to re-identify
any of the research participants.
• Many of the original participants are part of the Missouri twin study. This
caused some of the measured data (family structure, substance use) to
be declared sensitive. The sensitive data had a more restrictive data
access protocol.
HCP Data
38
ConnectomeDB
Moving to NIMH Data Archive (NDA)
ConnectomeDB
ConnectomeDB – Widespread Data Usage
• Clearly resulted in a lot of data use – transfers, papers, …
• Open access probably helped the community to adopt HCP data
collection as the current standard
• Even in this open access data set, there is still some information that is
sensitive and requires approval. When the data were at Wash U, the
only penalty for misusing the data was loss of further access to the data.
• No penalties were ever imposed for mistreating data.
• ABCD early data availability (needs DAC) seems to be around the same
level as HCP – does this mean that researchers will do what it takes to
get good data?
• Probably the key question to think about when deciding between open
access and a more restrictive model is what penalties need to be
imposed if the data are not treated in accord with the data access
agreement.
HCP – Open Access
42
The “WU-Minn”HCP consortium of the initial HCP Dataset
• Understanding complex diseases need lots of different data
from a variety of sources.
• Informed consents, national laws concerning data sharing,
and investigator preferences can all restrict the aggregation
of data.
• All of those issues can be solved, with some effort.
• If you have an option, deciding whether to share data under a
very open access model or in a federal database should be
made based on what needs to happen if the data are not
treated appropriately.
• Even though it is easier to get data from an open repository,
early results from the ABCD project suggest that users will
take the steps needed to get access to high quality data.
Summary

More Related Content

What's hot

Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science
Beth Plale
 
Chain Event: Intro - Sean Manion
Chain Event: Intro - Sean ManionChain Event: Intro - Sean Manion
Chain Event: Intro - Sean Manion
Sean Manion PhD
 
A Successful Academic Medical Center Must be a Truly Digital Enterprise
A Successful Academic Medical Center Must be a Truly Digital EnterpriseA Successful Academic Medical Center Must be a Truly Digital Enterprise
A Successful Academic Medical Center Must be a Truly Digital Enterprise
Philip Bourne
 
From Data Sharing to Data Stewardship
From Data Sharing to Data StewardshipFrom Data Sharing to Data Stewardship
From Data Sharing to Data Stewardship
ICPSR
 
A SWOT Analysis of Data Science @ NIH
A SWOT Analysis of Data Science @ NIHA SWOT Analysis of Data Science @ NIH
A SWOT Analysis of Data Science @ NIH
Philip Bourne
 
Use of data in safe havens: ethics and reproducibility issues
Use of data in safe havens: ethics and reproducibility issuesUse of data in safe havens: ethics and reproducibility issues
Use of data in safe havens: ethics and reproducibility issues
Louise Corti
 
20160523 23 Research Data Things
20160523 23 Research Data Things20160523 23 Research Data Things
20160523 23 Research Data Things
Katina Toufexis
 
20160719 23 Research Data Things
20160719 23 Research Data Things20160719 23 Research Data Things
20160719 23 Research Data Things
Katina Toufexis
 
The NIH as a Digital Enterprise: Implications for PAG
The NIH as a Digital Enterprise: Implications for PAGThe NIH as a Digital Enterprise: Implications for PAG
The NIH as a Digital Enterprise: Implications for PAG
Philip Bourne
 
DataONE Education Module 10: Legal and Policy Issues
DataONE Education Module 10: Legal and Policy IssuesDataONE Education Module 10: Legal and Policy Issues
DataONE Education Module 10: Legal and Policy Issues
DataONE
 
Meeting the Computational Challenges Associated with Human Health
Meeting the Computational Challenges Associated with Human HealthMeeting the Computational Challenges Associated with Human Health
Meeting the Computational Challenges Associated with Human Health
Philip Bourne
 
The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big Data
Philip Bourne
 
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinaiDataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
datascienceiqss
 
Understanding ICPSR - An Orientation and Tours of ICPSR Data Services and Edu...
Understanding ICPSR - An Orientation and Tours of ICPSR Data Services and Edu...Understanding ICPSR - An Orientation and Tours of ICPSR Data Services and Edu...
Understanding ICPSR - An Orientation and Tours of ICPSR Data Services and Edu...
ICPSR
 
Big Data in Biomedicine: Where is the NIH Headed
Big Data in Biomedicine: Where is the NIH HeadedBig Data in Biomedicine: Where is the NIH Headed
Big Data in Biomedicine: Where is the NIH Headed
Philip Bourne
 
The Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health SystemThe Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health System
Michel Dumontier
 
Instructional Data Sets from Q-step Launch Event (Univ of Exeter) 3-20-2014
Instructional Data Sets from Q-step Launch Event (Univ of Exeter) 3-20-2014Instructional Data Sets from Q-step Launch Event (Univ of Exeter) 3-20-2014
Instructional Data Sets from Q-step Launch Event (Univ of Exeter) 3-20-2014
ICPSR
 
Data Sharing & Data Citation
Data Sharing & Data CitationData Sharing & Data Citation
Data Sharing & Data Citation
Micah Altman
 
NIH Big Data to Knowledge (BD2K)
NIH Big Data to Knowledge (BD2K)NIH Big Data to Knowledge (BD2K)
NIH Big Data to Knowledge (BD2K)
Lance K. Manning
 
Research in the time of Covid: Surveying impacts on Early Career Researchers
Research in the time of Covid: Surveying impacts on Early Career ResearchersResearch in the time of Covid: Surveying impacts on Early Career Researchers
Research in the time of Covid: Surveying impacts on Early Career Researchers
Rebecca Grant
 

What's hot (20)

Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science
 
Chain Event: Intro - Sean Manion
Chain Event: Intro - Sean ManionChain Event: Intro - Sean Manion
Chain Event: Intro - Sean Manion
 
A Successful Academic Medical Center Must be a Truly Digital Enterprise
A Successful Academic Medical Center Must be a Truly Digital EnterpriseA Successful Academic Medical Center Must be a Truly Digital Enterprise
A Successful Academic Medical Center Must be a Truly Digital Enterprise
 
From Data Sharing to Data Stewardship
From Data Sharing to Data StewardshipFrom Data Sharing to Data Stewardship
From Data Sharing to Data Stewardship
 
A SWOT Analysis of Data Science @ NIH
A SWOT Analysis of Data Science @ NIHA SWOT Analysis of Data Science @ NIH
A SWOT Analysis of Data Science @ NIH
 
Use of data in safe havens: ethics and reproducibility issues
Use of data in safe havens: ethics and reproducibility issuesUse of data in safe havens: ethics and reproducibility issues
Use of data in safe havens: ethics and reproducibility issues
 
20160523 23 Research Data Things
20160523 23 Research Data Things20160523 23 Research Data Things
20160523 23 Research Data Things
 
20160719 23 Research Data Things
20160719 23 Research Data Things20160719 23 Research Data Things
20160719 23 Research Data Things
 
The NIH as a Digital Enterprise: Implications for PAG
The NIH as a Digital Enterprise: Implications for PAGThe NIH as a Digital Enterprise: Implications for PAG
The NIH as a Digital Enterprise: Implications for PAG
 
DataONE Education Module 10: Legal and Policy Issues
DataONE Education Module 10: Legal and Policy IssuesDataONE Education Module 10: Legal and Policy Issues
DataONE Education Module 10: Legal and Policy Issues
 
Meeting the Computational Challenges Associated with Human Health
Meeting the Computational Challenges Associated with Human HealthMeeting the Computational Challenges Associated with Human Health
Meeting the Computational Challenges Associated with Human Health
 
The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big Data
 
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinaiDataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
 
Understanding ICPSR - An Orientation and Tours of ICPSR Data Services and Edu...
Understanding ICPSR - An Orientation and Tours of ICPSR Data Services and Edu...Understanding ICPSR - An Orientation and Tours of ICPSR Data Services and Edu...
Understanding ICPSR - An Orientation and Tours of ICPSR Data Services and Edu...
 
Big Data in Biomedicine: Where is the NIH Headed
Big Data in Biomedicine: Where is the NIH HeadedBig Data in Biomedicine: Where is the NIH Headed
Big Data in Biomedicine: Where is the NIH Headed
 
The Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health SystemThe Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health System
 
Instructional Data Sets from Q-step Launch Event (Univ of Exeter) 3-20-2014
Instructional Data Sets from Q-step Launch Event (Univ of Exeter) 3-20-2014Instructional Data Sets from Q-step Launch Event (Univ of Exeter) 3-20-2014
Instructional Data Sets from Q-step Launch Event (Univ of Exeter) 3-20-2014
 
Data Sharing & Data Citation
Data Sharing & Data CitationData Sharing & Data Citation
Data Sharing & Data Citation
 
NIH Big Data to Knowledge (BD2K)
NIH Big Data to Knowledge (BD2K)NIH Big Data to Knowledge (BD2K)
NIH Big Data to Knowledge (BD2K)
 
Research in the time of Covid: Surveying impacts on Early Career Researchers
Research in the time of Covid: Surveying impacts on Early Career ResearchersResearch in the time of Covid: Surveying impacts on Early Career Researchers
Research in the time of Covid: Surveying impacts on Early Career Researchers
 

Similar to Data Governance in two different data archives: When is a federal data repository useful

NIH Data Sharing Plan Workshop - Handout
NIH Data Sharing Plan Workshop - HandoutNIH Data Sharing Plan Workshop - Handout
NIH Data Sharing Plan Workshop - Handout
IUPUI
 
NIH Data Sharing Plan Workshop - Slides
NIH Data Sharing Plan Workshop - SlidesNIH Data Sharing Plan Workshop - Slides
NIH Data Sharing Plan Workshop - Slides
IUPUI
 
Managing data responsibly to enable research interity
Managing data responsibly to enable research interityManaging data responsibly to enable research interity
Managing data responsibly to enable research interity
IUPUI
 
Alain Frey Research Data for universities and information producers
Alain Frey Research Data for universities and information producersAlain Frey Research Data for universities and information producers
Alain Frey Research Data for universities and information producersIncisive_Events
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET
 
Preparing Research Data for Sharing
Preparing Research Data for SharingPreparing Research Data for Sharing
Preparing Research Data for Sharing
London School of Hygiene and Tropical Medicine
 
big-data-and-data-sharing_ethical-issues.pdf
big-data-and-data-sharing_ethical-issues.pdfbig-data-and-data-sharing_ethical-issues.pdf
big-data-and-data-sharing_ethical-issues.pdf
AsefaAdimasu2
 
Lecture 6_Data acquisition.pptx power points
Lecture 6_Data acquisition.pptx power pointsLecture 6_Data acquisition.pptx power points
Lecture 6_Data acquisition.pptx power points
Josephmwanika
 
Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016
Fiona Nielsen
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
Chris Dwan
 
Overview of Emerging Requirements for Data Management of Federally Funded Res...
Overview of Emerging Requirements for Data Management of Federally Funded Res...Overview of Emerging Requirements for Data Management of Federally Funded Res...
Overview of Emerging Requirements for Data Management of Federally Funded Res...
Richard Huffine
 
Overview of Emerging Requirements for Data Management of Federally Funded Res...
Overview of Emerging Requirements for Data Management of Federally Funded Res...Overview of Emerging Requirements for Data Management of Federally Funded Res...
Overview of Emerging Requirements for Data Management of Federally Funded Res...
Richard Huffine
 
dkNET Office Hours - "Are You Ready for 2023? New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023? New NIH Data Management and Sha...dkNET Office Hours - "Are You Ready for 2023? New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023? New NIH Data Management and Sha...
dkNET
 
Meeting Federal Research Requirements
Meeting Federal Research RequirementsMeeting Federal Research Requirements
Meeting Federal Research Requirements
ICPSR
 
Data Policy for Open Science
Data Policy for Open ScienceData Policy for Open Science
Data Policy for Open Science
Research Data Alliance
 
Data Policy for Open Science
Data Policy for Open ScienceData Policy for Open Science
Data Policy for Open Science
Mark Parsons
 
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
dkNET
 
Compliance: Data Management Plans and Public Access to Data
Compliance: Data Management Plans and Public Access to DataCompliance: Data Management Plans and Public Access to Data
Compliance: Data Management Plans and Public Access to Data
Margaret Henderson
 
BLC & Digital Science: Mark Hahnel, Figshare
BLC & Digital Science: Mark Hahnel, FigshareBLC & Digital Science: Mark Hahnel, Figshare
BLC & Digital Science: Mark Hahnel, Figshare
Boston Library Consortium, Inc.
 
PhRMA Some Early Thoughts
PhRMA Some Early ThoughtsPhRMA Some Early Thoughts
PhRMA Some Early Thoughts
Philip Bourne
 

Similar to Data Governance in two different data archives: When is a federal data repository useful (20)

NIH Data Sharing Plan Workshop - Handout
NIH Data Sharing Plan Workshop - HandoutNIH Data Sharing Plan Workshop - Handout
NIH Data Sharing Plan Workshop - Handout
 
NIH Data Sharing Plan Workshop - Slides
NIH Data Sharing Plan Workshop - SlidesNIH Data Sharing Plan Workshop - Slides
NIH Data Sharing Plan Workshop - Slides
 
Managing data responsibly to enable research interity
Managing data responsibly to enable research interityManaging data responsibly to enable research interity
Managing data responsibly to enable research interity
 
Alain Frey Research Data for universities and information producers
Alain Frey Research Data for universities and information producersAlain Frey Research Data for universities and information producers
Alain Frey Research Data for universities and information producers
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
 
Preparing Research Data for Sharing
Preparing Research Data for SharingPreparing Research Data for Sharing
Preparing Research Data for Sharing
 
big-data-and-data-sharing_ethical-issues.pdf
big-data-and-data-sharing_ethical-issues.pdfbig-data-and-data-sharing_ethical-issues.pdf
big-data-and-data-sharing_ethical-issues.pdf
 
Lecture 6_Data acquisition.pptx power points
Lecture 6_Data acquisition.pptx power pointsLecture 6_Data acquisition.pptx power points
Lecture 6_Data acquisition.pptx power points
 
Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
Overview of Emerging Requirements for Data Management of Federally Funded Res...
Overview of Emerging Requirements for Data Management of Federally Funded Res...Overview of Emerging Requirements for Data Management of Federally Funded Res...
Overview of Emerging Requirements for Data Management of Federally Funded Res...
 
Overview of Emerging Requirements for Data Management of Federally Funded Res...
Overview of Emerging Requirements for Data Management of Federally Funded Res...Overview of Emerging Requirements for Data Management of Federally Funded Res...
Overview of Emerging Requirements for Data Management of Federally Funded Res...
 
dkNET Office Hours - "Are You Ready for 2023? New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023? New NIH Data Management and Sha...dkNET Office Hours - "Are You Ready for 2023? New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023? New NIH Data Management and Sha...
 
Meeting Federal Research Requirements
Meeting Federal Research RequirementsMeeting Federal Research Requirements
Meeting Federal Research Requirements
 
Data Policy for Open Science
Data Policy for Open ScienceData Policy for Open Science
Data Policy for Open Science
 
Data Policy for Open Science
Data Policy for Open ScienceData Policy for Open Science
Data Policy for Open Science
 
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
 
Compliance: Data Management Plans and Public Access to Data
Compliance: Data Management Plans and Public Access to DataCompliance: Data Management Plans and Public Access to Data
Compliance: Data Management Plans and Public Access to Data
 
BLC & Digital Science: Mark Hahnel, Figshare
BLC & Digital Science: Mark Hahnel, FigshareBLC & Digital Science: Mark Hahnel, Figshare
BLC & Digital Science: Mark Hahnel, Figshare
 
PhRMA Some Early Thoughts
PhRMA Some Early ThoughtsPhRMA Some Early Thoughts
PhRMA Some Early Thoughts
 

Recently uploaded

Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
NoelManyise1
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
zeex60
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
ronaldlakony0
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
frank0071
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 

Recently uploaded (20)

Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 

Data Governance in two different data archives: When is a federal data repository useful

  • 1. Data Governance in Two Different Data Archives: When is a Federal Data Repository Useful? Greg Farber Director, Office of Technology Development and Coordination National Institute of Mental Health National Institutes of Health March 2018
  • 2. 1) Most research subjects want their data to be used to understand disease broadly. They are not too concerned about how researchers use their data. 2) The diseases we are trying to understand today are complex meaning that the same symptoms can have many different underlying biological causes. Except in the cases where a deeply penetrant point mutation uncovers a single biological pathway to a disease, understanding the “subgroups” for complex diseases requires data from large populations who have similar symptoms. 3) Differences in data sharing laws in different countries makes it difficult or impossible to move data across international borders. Federating data archives that are storing data in a similar way provides an inelegant but workable solution to this problem. 4) Despite the urgent need to aggregate data to understand complex diseases, individual consents and local laws must be respected. Guiding Ideas 2
  • 3. Policy Considerations can be Manipulated to Become an Excuse Not to Share Data 3
  • 4. • Contrast two data archives that have built the infrastructure necessary to aggregate data on complex diseases. • NIMH Data Archive (NDA) Overview ▪ Federal Data Repository where the data are owned by the US National Institutes of Health ▪ Infrastructure ▪ Policy Issues • Human Connectome Program (HCP) ▪ Large NIH funded project ▪ Access to most data was by self certification ▪ Initial Data Distribution was through Washington University Roadmap
  • 5. • Stores data from experiments involving human subjects that are deposited by research laboratories. ▪ Federal data repository ▪ Originally contained data from human subjects related to mental illness (and control subjects), but that has expanded in a number of ways over the past 12 months. Most subjects have consented to broad data sharing. ▪ Data are available to the research community through a not too difficult application process. ▪ Both submission and access to subject level data require approval of an institutional official. ▪ Summary data are available to everyone with a browser (https://data- archive.nimh.nih.gov/) • Begun in late 2006, and first data was received in 2008 • The data types include demographic data, clinical assessments, imaging data, and –omic data. There are no formal limits to the types of data that can be stored in NDA. NIMH Data Archive
  • 6.
  • 7. • The NDA currently makes data available to the research community from 200,000 subjects. Additional data are held by the NDA but are not yet ready for sharing because the grant is still active and/or has not published papers. • Many subjects have longitudinal data. • ~1.1 PB of imaging and –omic data is securely stored in the Amazon cloud. • Currently, the NDA does not contain any personally identifiable information, but we expect to begin holding such data in the near future (data from mobile devices). ▪ This change will likely require that NDA verify that the use of the data has been approved by an Institutional Review Board. NIMH Data Archive – Current Size and Scope
  • 8. • It is best to think of NDA as a large (~182,000 data elements by ~200,000 people), sparse, two dimensional matrix. NDA Structure – Rows and Columns are the Building Blocks 8
  • 9. • The NDA data dictionary is one of the key building blocks for this repository. It provides a flexible and extensible framework for data definition by the research community. • 2,000+ instruments, freely available to anyone ▪ 180,000+ unique data elements and growing ▪ Data dictionaries describing • Clinical • Genomics/Proteomics • MRI Modalities • Other complex data (EEG, eye tracking) • Accommodates any data type and data structure • Describes the data collected by the research community Data Dictionary – The First Building Block
  • 10. • Curated by NDA (this takes a lot of time) • Data held in different archives needs to use common data dictionaries to allow deep federation. • The associated validation tool allows investigators to quickly perform quality control tests of their data without submitting data anywhere. • Data in archives that don’t have a similar QC step are likely to have issues. • Both to enhance the quality of the science and to ensure that the time and effort that research subjects are spending in our research protocols, the validation tool should be run frequently (daily, weekly). This is common practice in many other domains. Data Dictionary – The First Building Block
  • 11.
  • 12.
  • 13.
  • 14. • The NDA GUID software allows any researcher to generate a unique identifier using some information from a birth certificate. • If the same information is entered in different laboratories, the same GUID will be generated. • This strategy allows NDA to aggregate data on the same subject collected in multiple laboratories without holding any of the personally identifiable information about that subject. • The GUID is now being discussed in a number of additional research communities. We think we have a reasonable plan to prevent a GUID from becoming something like a social security number (which would be identifying in itself) • External studies indicate that the GUID implementation is pretty robust both to false positives and false negatives in large populations. Global Unique Identifier – the Other Building Block
  • 15. Federation – The GUID Does Work
  • 16. At this point, data has been received from the laboratory that measured the data. Each subject has a GUID or a pseudo-GUID. A data dictionary has have been defined, and the submitted data have been validated against that definition. How does an outside user find data they are interested in?
  • 17.
  • 18. An Example of Data Associated with a Particular Laboratory
  • 19.
  • 20. Now assigning DOIs to each study, and we can track how often a DOI link is clicked (the start of a data citation)
  • 21.
  • 22. Results in 750 subjects being discovered
  • 23. • Assertion: Any consent language that restricts the use of the data for particular purposes (for autism research…) results in profound negative consequences. • For example, if a researcher is trying to aggregate data between subjects with schizophrenia and autism to understand common symptoms that are observed in the two diagnostic groups, a consent that limited a dataset for use only to understand one of those diagnostic conditions would probably mean the data is not accessible for a comparison study. • Restricted data are also probably off limits for those who are trying to use data mining techniques to develop or substantiate a hypothesis. • There are some cases where restrictive consents might be appropriate, but this should be the rare exception. Policy – Consents 23
  • 24. • NIMH expects that research we pay for involving human subjects will result in that data being made available in NDA. • Journals can also have a positive role to play in requiring that data be placed in a repository prior to publication. • Asking for data volunteers probably isn’t good enough right now. Policy – Data Deposition 24
  • 25. • Summary data are available to anyone via the web site, but accessing subject level data requires a data access form. • Similarly, a data submission agreement is required that certifies that the data were consented for sharing. • Both forms require the signature from the PI and an institutional official. This means that the research institution is formally responsible for ensuring that the data are “treated with respect”. • Although neither form is complicated, they do raise barriers to accessing the data. Policy – NDA Data Access and Data Submission 25
  • 26. • The NIMH Data Archive does hold some data that were collected outside the US. • For those datasets, the institutional official has decided that depositing data is allowed both by the terms of the informed consent and by the laws in that country. • When there are restrictions to allowing data to be moved, it is still possible to make it easy for the research community to find data by federating data archives. Policy – Data from Institutions Outside the US 26
  • 27. • For NDA, submitting data is separate from sharing that data with the research community. • Data are shared when the grant is complete or when a paper is published. • Other sharing timelines are possible. • No matter when the data are shared, data need to be submitted on a regular basis. This ensures that the data from a grant award has been submitted before funding is exhausted. More importantly, periodic data submission ensures that the data have undergone basic QC checks as they are collected. Policy – Timeline for Data Sharing 27
  • 28. • Responding to a number of instances of high visibility/impact experiments that were not thoughtfully designed, NIH (and NIMH) have instituted a number of programs to enhance rigor and reproducibility in research supported by NIH. • These discussions with the community started in June 2012. The new guidelines to increase rigor and reproducibility are outlined in NOT-OD-15-103 and at a web site (https://www.nih.gov/research-training/rigor-reproducibility). • Data archives plays an important role in improving the rigor and reproducibility of NIMH funded research. Rigor and Reproducibility – Data Archives Help 28
  • 29. • Data dictionaries are a key part of the NDA infrastructure. Each item in a data dictionary has an allowable range of values. The NDA has a validation tool that allows users to check a data set to see if it conforms with the allowable ranges and formats in a data dictionary. • Because of our mandated data deposition schedule the validation tool allows labs to find errors every 6 months when data are deposited (or more often if they choose). Rigor and Reproducibility - 2 29
  • 30. • The NDA makes it easy to identify the data associated with a publication, and we assign a doi to that dataset to make it trivial for the research community to find the data. • Identifying the data from a publication allows researchers to look at all of the data collected under an award and compare that to the data used in the publication. Rigor and Reproducibility - 3 30
  • 31. • There are “professional” research participants who seem to make a living volunteering for clinical studies. Websites exist that make it relatively easy for such participants to find out the right answers to screening questions to be admitted to a study. Clearly, this can be dangerous to the volunteer and can also put the rigor and reproducibility of the study at risk. • Recruitment in certain diagnostic categories may take place in a small number of clinical centers. This means that papers from many different research groups may be sampling from a smaller population than the “independent” papers might suggest. • The NDA GUID helps the research community understand the size of these problems and deal with these issues. • There are also commercial services that aid in the screening for someone who is participating in multiple clinical trials. GUIDs and Rigor/Reproducibility 31
  • 33. 1) NIMH Data Archive – very heterogeneous data collected in multiple laboratories. NDA attempts to aggregate this data using a global unique identifier system as well as data dictionaries to describe the myriad experiments. 2) Human Connectome Program – heterogeneous data (clinical assessments, imaging, MEG, genomics) collected using a common protocol. The first phase of this project involved data collection from typical research subjects at a single site. The project has recently expanded to include data collected across the lifespan for control subjects as well as from subjects with a diagnosis. Those datasets are collected at multiple laboratories, but still use similar data collection protocols. Two Different Sorts of Data Archives
  • 34. • The NIH Human Connectome Project (HCP) is supported by the NIH Neuroscience Blueprint ICs • The HCP is an ambitious effort to map the neural pathways that underlie human brain function. The overarching purpose of the Project is to acquire and share data about the structural and functional connectivity of the human brain. It has greatly advance the capabilities for imaging and analyzing brain connections, resulting in improved sensitivity, resolution, and utility, thereby accelerating progress in the emerging field of human connectomics. • Phase 1 of the HCP resulted in two awards ■ David Van Essen and Kamil Ugurbil, Wash U and U Minnesota ■ Bruce Rosen, MGH Human Connectome Project 34
  • 35. 1) Deliver advanced MRI scanners and techniques with high spatial and temporal resolution for functional and diffusion MRI. ■ Both the MGH and the Wash U MRIs worked as designed and were able to collect data quickly. Siemens learned a great deal from collaborating on both instruments, and their new family of 3T MRIs (the Prisma) has operating characteristics similar to the Wash U scanner. ■ The supplements to port the pulse sequences to other laboratories and to other manufacturers has also been successful. Phase 1 Connectome Accomplishments 35
  • 36. 2) Deliver high quality data to the research community ■ Wash U has released data from 1200 subjects. This includes behavioral assessments, structural MRIs, rs fMRI, task fMRI, and diffusion experiments. MEG data has also just been released. This is the first time that a large imaging award adopted “genome speed” data release. ■ Data from MGH are being made available on their web site as well as at the Wash U web site. ■ The data are being widely used by the research community. More than 100 papers cited the Wash U grant at the point where data collection was only half complete. ■ High visibility papers have appeared. Researchers from outside the WU-Minn collaboration have authored some of those papers. Phase 1 Connectome Accomplishments 36
  • 37. • Based on the results from the original connectome project (which collected data on 22-34 year old healthy subjects), NIH decided to fund awards for a lifespan connectome. Three awards have been made that will cover the age range from birth to the oldest old (90+). • In addition, NIH has funded 14 awards to measure connectomics on groups that have some sort of diagnosis (Alzheimers, low vision, dementia, epilepsy, mood and anxiety disorders, psychosis, …). • Over 8,000 subjects are participating and nearly 12,000 scans are expected in the data infrastructure by 2021. Phenotypic and clinical assessments as well as other non-MRI data are being collected and made available. • In addition, the Adolescent Brain Cognitive Development (ABCD) study has chosen to use the connectome data collection protocol. That study intends to enroll 10,000 children aged 9-10 and follow them into early adulthood. This dataset requires a data access agreement. Connectome Today 37
  • 38. • A Connectome Coordination Facility has been created to hold all of the data (https://www.humanconnectome.org/). • The original HCP consents allowed almost unlimited access to the data (clinical and phenotypic data as well as the MRI and MEG data). • An individual who wanted data simply enters a working web site into the registration system and certifies that the will not attempt to re-identify any of the research participants. • Many of the original participants are part of the Missouri twin study. This caused some of the measured data (family structure, substance use) to be declared sensitive. The sensitive data had a more restrictive data access protocol. HCP Data 38
  • 39. ConnectomeDB Moving to NIMH Data Archive (NDA)
  • 42. • Clearly resulted in a lot of data use – transfers, papers, … • Open access probably helped the community to adopt HCP data collection as the current standard • Even in this open access data set, there is still some information that is sensitive and requires approval. When the data were at Wash U, the only penalty for misusing the data was loss of further access to the data. • No penalties were ever imposed for mistreating data. • ABCD early data availability (needs DAC) seems to be around the same level as HCP – does this mean that researchers will do what it takes to get good data? • Probably the key question to think about when deciding between open access and a more restrictive model is what penalties need to be imposed if the data are not treated in accord with the data access agreement. HCP – Open Access 42
  • 43. The “WU-Minn”HCP consortium of the initial HCP Dataset
  • 44. • Understanding complex diseases need lots of different data from a variety of sources. • Informed consents, national laws concerning data sharing, and investigator preferences can all restrict the aggregation of data. • All of those issues can be solved, with some effort. • If you have an option, deciding whether to share data under a very open access model or in a federal database should be made based on what needs to happen if the data are not treated appropriately. • Even though it is easier to get data from an open repository, early results from the ABCD project suggest that users will take the steps needed to get access to high quality data. Summary