Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Governance in two different data archives: When is a federal data repository useful


Published on

Talk from Dr Greg Farber at the Human Brain Project conference 2018 on data governance in international neuro-ICT collaborations

Published in: Science
  • Be the first to comment

  • Be the first to like this

Data Governance in two different data archives: When is a federal data repository useful

  1. 1. Data Governance in Two Different Data Archives: When is a Federal Data Repository Useful? Greg Farber Director, Office of Technology Development and Coordination National Institute of Mental Health National Institutes of Health March 2018
  2. 2. 1) Most research subjects want their data to be used to understand disease broadly. They are not too concerned about how researchers use their data. 2) The diseases we are trying to understand today are complex meaning that the same symptoms can have many different underlying biological causes. Except in the cases where a deeply penetrant point mutation uncovers a single biological pathway to a disease, understanding the “subgroups” for complex diseases requires data from large populations who have similar symptoms. 3) Differences in data sharing laws in different countries makes it difficult or impossible to move data across international borders. Federating data archives that are storing data in a similar way provides an inelegant but workable solution to this problem. 4) Despite the urgent need to aggregate data to understand complex diseases, individual consents and local laws must be respected. Guiding Ideas 2
  3. 3. Policy Considerations can be Manipulated to Become an Excuse Not to Share Data 3
  4. 4. • Contrast two data archives that have built the infrastructure necessary to aggregate data on complex diseases. • NIMH Data Archive (NDA) Overview ▪ Federal Data Repository where the data are owned by the US National Institutes of Health ▪ Infrastructure ▪ Policy Issues • Human Connectome Program (HCP) ▪ Large NIH funded project ▪ Access to most data was by self certification ▪ Initial Data Distribution was through Washington University Roadmap
  5. 5. • Stores data from experiments involving human subjects that are deposited by research laboratories. ▪ Federal data repository ▪ Originally contained data from human subjects related to mental illness (and control subjects), but that has expanded in a number of ways over the past 12 months. Most subjects have consented to broad data sharing. ▪ Data are available to the research community through a not too difficult application process. ▪ Both submission and access to subject level data require approval of an institutional official. ▪ Summary data are available to everyone with a browser (https://data- • Begun in late 2006, and first data was received in 2008 • The data types include demographic data, clinical assessments, imaging data, and –omic data. There are no formal limits to the types of data that can be stored in NDA. NIMH Data Archive
  6. 6. • The NDA currently makes data available to the research community from 200,000 subjects. Additional data are held by the NDA but are not yet ready for sharing because the grant is still active and/or has not published papers. • Many subjects have longitudinal data. • ~1.1 PB of imaging and –omic data is securely stored in the Amazon cloud. • Currently, the NDA does not contain any personally identifiable information, but we expect to begin holding such data in the near future (data from mobile devices). ▪ This change will likely require that NDA verify that the use of the data has been approved by an Institutional Review Board. NIMH Data Archive – Current Size and Scope
  7. 7. • It is best to think of NDA as a large (~182,000 data elements by ~200,000 people), sparse, two dimensional matrix. NDA Structure – Rows and Columns are the Building Blocks 8
  8. 8. • The NDA data dictionary is one of the key building blocks for this repository. It provides a flexible and extensible framework for data definition by the research community. • 2,000+ instruments, freely available to anyone ▪ 180,000+ unique data elements and growing ▪ Data dictionaries describing • Clinical • Genomics/Proteomics • MRI Modalities • Other complex data (EEG, eye tracking) • Accommodates any data type and data structure • Describes the data collected by the research community Data Dictionary – The First Building Block
  9. 9. • Curated by NDA (this takes a lot of time) • Data held in different archives needs to use common data dictionaries to allow deep federation. • The associated validation tool allows investigators to quickly perform quality control tests of their data without submitting data anywhere. • Data in archives that don’t have a similar QC step are likely to have issues. • Both to enhance the quality of the science and to ensure that the time and effort that research subjects are spending in our research protocols, the validation tool should be run frequently (daily, weekly). This is common practice in many other domains. Data Dictionary – The First Building Block
  10. 10. • The NDA GUID software allows any researcher to generate a unique identifier using some information from a birth certificate. • If the same information is entered in different laboratories, the same GUID will be generated. • This strategy allows NDA to aggregate data on the same subject collected in multiple laboratories without holding any of the personally identifiable information about that subject. • The GUID is now being discussed in a number of additional research communities. We think we have a reasonable plan to prevent a GUID from becoming something like a social security number (which would be identifying in itself) • External studies indicate that the GUID implementation is pretty robust both to false positives and false negatives in large populations. Global Unique Identifier – the Other Building Block
  11. 11. Federation – The GUID Does Work
  12. 12. At this point, data has been received from the laboratory that measured the data. Each subject has a GUID or a pseudo-GUID. A data dictionary has have been defined, and the submitted data have been validated against that definition. How does an outside user find data they are interested in?
  13. 13. An Example of Data Associated with a Particular Laboratory
  14. 14. Now assigning DOIs to each study, and we can track how often a DOI link is clicked (the start of a data citation)
  15. 15. Results in 750 subjects being discovered
  16. 16. • Assertion: Any consent language that restricts the use of the data for particular purposes (for autism research…) results in profound negative consequences. • For example, if a researcher is trying to aggregate data between subjects with schizophrenia and autism to understand common symptoms that are observed in the two diagnostic groups, a consent that limited a dataset for use only to understand one of those diagnostic conditions would probably mean the data is not accessible for a comparison study. • Restricted data are also probably off limits for those who are trying to use data mining techniques to develop or substantiate a hypothesis. • There are some cases where restrictive consents might be appropriate, but this should be the rare exception. Policy – Consents 23
  17. 17. • NIMH expects that research we pay for involving human subjects will result in that data being made available in NDA. • Journals can also have a positive role to play in requiring that data be placed in a repository prior to publication. • Asking for data volunteers probably isn’t good enough right now. Policy – Data Deposition 24
  18. 18. • Summary data are available to anyone via the web site, but accessing subject level data requires a data access form. • Similarly, a data submission agreement is required that certifies that the data were consented for sharing. • Both forms require the signature from the PI and an institutional official. This means that the research institution is formally responsible for ensuring that the data are “treated with respect”. • Although neither form is complicated, they do raise barriers to accessing the data. Policy – NDA Data Access and Data Submission 25
  19. 19. • The NIMH Data Archive does hold some data that were collected outside the US. • For those datasets, the institutional official has decided that depositing data is allowed both by the terms of the informed consent and by the laws in that country. • When there are restrictions to allowing data to be moved, it is still possible to make it easy for the research community to find data by federating data archives. Policy – Data from Institutions Outside the US 26
  20. 20. • For NDA, submitting data is separate from sharing that data with the research community. • Data are shared when the grant is complete or when a paper is published. • Other sharing timelines are possible. • No matter when the data are shared, data need to be submitted on a regular basis. This ensures that the data from a grant award has been submitted before funding is exhausted. More importantly, periodic data submission ensures that the data have undergone basic QC checks as they are collected. Policy – Timeline for Data Sharing 27
  21. 21. • Responding to a number of instances of high visibility/impact experiments that were not thoughtfully designed, NIH (and NIMH) have instituted a number of programs to enhance rigor and reproducibility in research supported by NIH. • These discussions with the community started in June 2012. The new guidelines to increase rigor and reproducibility are outlined in NOT-OD-15-103 and at a web site ( • Data archives plays an important role in improving the rigor and reproducibility of NIMH funded research. Rigor and Reproducibility – Data Archives Help 28
  22. 22. • Data dictionaries are a key part of the NDA infrastructure. Each item in a data dictionary has an allowable range of values. The NDA has a validation tool that allows users to check a data set to see if it conforms with the allowable ranges and formats in a data dictionary. • Because of our mandated data deposition schedule the validation tool allows labs to find errors every 6 months when data are deposited (or more often if they choose). Rigor and Reproducibility - 2 29
  23. 23. • The NDA makes it easy to identify the data associated with a publication, and we assign a doi to that dataset to make it trivial for the research community to find the data. • Identifying the data from a publication allows researchers to look at all of the data collected under an award and compare that to the data used in the publication. Rigor and Reproducibility - 3 30
  24. 24. • There are “professional” research participants who seem to make a living volunteering for clinical studies. Websites exist that make it relatively easy for such participants to find out the right answers to screening questions to be admitted to a study. Clearly, this can be dangerous to the volunteer and can also put the rigor and reproducibility of the study at risk. • Recruitment in certain diagnostic categories may take place in a small number of clinical centers. This means that papers from many different research groups may be sampling from a smaller population than the “independent” papers might suggest. • The NDA GUID helps the research community understand the size of these problems and deal with these issues. • There are also commercial services that aid in the screening for someone who is participating in multiple clinical trials. GUIDs and Rigor/Reproducibility 31
  25. 25. NIH/NIMH Data Archives Staff
  26. 26. 1) NIMH Data Archive – very heterogeneous data collected in multiple laboratories. NDA attempts to aggregate this data using a global unique identifier system as well as data dictionaries to describe the myriad experiments. 2) Human Connectome Program – heterogeneous data (clinical assessments, imaging, MEG, genomics) collected using a common protocol. The first phase of this project involved data collection from typical research subjects at a single site. The project has recently expanded to include data collected across the lifespan for control subjects as well as from subjects with a diagnosis. Those datasets are collected at multiple laboratories, but still use similar data collection protocols. Two Different Sorts of Data Archives
  27. 27. • The NIH Human Connectome Project (HCP) is supported by the NIH Neuroscience Blueprint ICs • The HCP is an ambitious effort to map the neural pathways that underlie human brain function. The overarching purpose of the Project is to acquire and share data about the structural and functional connectivity of the human brain. It has greatly advance the capabilities for imaging and analyzing brain connections, resulting in improved sensitivity, resolution, and utility, thereby accelerating progress in the emerging field of human connectomics. • Phase 1 of the HCP resulted in two awards ■ David Van Essen and Kamil Ugurbil, Wash U and U Minnesota ■ Bruce Rosen, MGH Human Connectome Project 34
  28. 28. 1) Deliver advanced MRI scanners and techniques with high spatial and temporal resolution for functional and diffusion MRI. ■ Both the MGH and the Wash U MRIs worked as designed and were able to collect data quickly. Siemens learned a great deal from collaborating on both instruments, and their new family of 3T MRIs (the Prisma) has operating characteristics similar to the Wash U scanner. ■ The supplements to port the pulse sequences to other laboratories and to other manufacturers has also been successful. Phase 1 Connectome Accomplishments 35
  29. 29. 2) Deliver high quality data to the research community ■ Wash U has released data from 1200 subjects. This includes behavioral assessments, structural MRIs, rs fMRI, task fMRI, and diffusion experiments. MEG data has also just been released. This is the first time that a large imaging award adopted “genome speed” data release. ■ Data from MGH are being made available on their web site as well as at the Wash U web site. ■ The data are being widely used by the research community. More than 100 papers cited the Wash U grant at the point where data collection was only half complete. ■ High visibility papers have appeared. Researchers from outside the WU-Minn collaboration have authored some of those papers. Phase 1 Connectome Accomplishments 36
  30. 30. • Based on the results from the original connectome project (which collected data on 22-34 year old healthy subjects), NIH decided to fund awards for a lifespan connectome. Three awards have been made that will cover the age range from birth to the oldest old (90+). • In addition, NIH has funded 14 awards to measure connectomics on groups that have some sort of diagnosis (Alzheimers, low vision, dementia, epilepsy, mood and anxiety disorders, psychosis, …). • Over 8,000 subjects are participating and nearly 12,000 scans are expected in the data infrastructure by 2021. Phenotypic and clinical assessments as well as other non-MRI data are being collected and made available. • In addition, the Adolescent Brain Cognitive Development (ABCD) study has chosen to use the connectome data collection protocol. That study intends to enroll 10,000 children aged 9-10 and follow them into early adulthood. This dataset requires a data access agreement. Connectome Today 37
  31. 31. • A Connectome Coordination Facility has been created to hold all of the data ( • The original HCP consents allowed almost unlimited access to the data (clinical and phenotypic data as well as the MRI and MEG data). • An individual who wanted data simply enters a working web site into the registration system and certifies that the will not attempt to re-identify any of the research participants. • Many of the original participants are part of the Missouri twin study. This caused some of the measured data (family structure, substance use) to be declared sensitive. The sensitive data had a more restrictive data access protocol. HCP Data 38
  32. 32. ConnectomeDB Moving to NIMH Data Archive (NDA)
  33. 33. ConnectomeDB
  34. 34. ConnectomeDB – Widespread Data Usage
  35. 35. • Clearly resulted in a lot of data use – transfers, papers, … • Open access probably helped the community to adopt HCP data collection as the current standard • Even in this open access data set, there is still some information that is sensitive and requires approval. When the data were at Wash U, the only penalty for misusing the data was loss of further access to the data. • No penalties were ever imposed for mistreating data. • ABCD early data availability (needs DAC) seems to be around the same level as HCP – does this mean that researchers will do what it takes to get good data? • Probably the key question to think about when deciding between open access and a more restrictive model is what penalties need to be imposed if the data are not treated in accord with the data access agreement. HCP – Open Access 42
  36. 36. The “WU-Minn”HCP consortium of the initial HCP Dataset
  37. 37. • Understanding complex diseases need lots of different data from a variety of sources. • Informed consents, national laws concerning data sharing, and investigator preferences can all restrict the aggregation of data. • All of those issues can be solved, with some effort. • If you have an option, deciding whether to share data under a very open access model or in a federal database should be made based on what needs to happen if the data are not treated appropriately. • Even though it is easier to get data from an open repository, early results from the ABCD project suggest that users will take the steps needed to get access to high quality data. Summary