Research Data Access and Preservation Summit, 2015
Minneapolis, MN
April 22-23, 2015
Part of “Beyond metadata: Supporting non-standardized documentation to facilitate data reuse”
Sarah Pickle, CLIR/DLF Social Science Data Curation Fellow, Penn State University Libraries
19. What do we need to keep in mind?
Ethical treatment of human subjects and approvals for data use
agreements
20. What do we need to keep in mind?
Ethical treatment of human subjects and approvals for data use
agreements
Secure technology for data transport, storage, access, and use
21. What do we need to keep in mind?
Ethical treatment of human subjects and approvals for data use
agreements
Secure technology for data transport, storage, access, and use
Efficient use of data in secure environment
22. What do we need to keep in mind?
Ethical treatment of human subjects and approvals for data use
agreements
Secure technology for data transport, storage, access, and use
Efficient use of data in secure environment
Greater good of sharing data
24. Preliminary recommendations
Ethical treatment of human subjects and approvals for data use agreements
disclosure risks • risky links • informed consent language • cultural context
25. Preliminary recommendations
Ethical treatment of human subjects and approvals for data use agreements
disclosure risks • risky links • informed consent language • cultural context
Secure technology for data transport, storage, access, and use
IT specs • data use agreement
26. Preliminary recommendations
Ethical treatment of human subjects and approvals for data use agreements
disclosure risks • risky links • informed consent language • cultural context
Secure technology for data transport, storage, access, and use
IT specs • data use agreement
Efficient use of data in secure environment
(specific!) deidentification activities
27. Preliminary recommendations
Ethical treatment of human subjects and approvals for data use agreements
disclosure risks • risky links • informed consent language • cultural context
Secure technology for data transport, storage, access, and use
IT specs • data use agreement
Efficient use of data in secure environment
(specific!) deidentification activities
Greater good of sharing data
pre-written appeal to IRB
Editor's Notes
This is how we used to think about data sharing.
This image is from the cover of a book published in 1985.
But, if we’re honest with ourselves, it’s not that far off from how we still share sensitive data today. In fact, if we just replace the floppy with a DVD…
Yep—that’s pretty much what it looks like. When sensitive data are actually shared, DVDs with those data are passed around from agency to researcher or between researchers and stored in locked boxes in secure data rooms. Access and use are restricted.
Because of the privacy concerns around sensitive data, there are miles of red tape between those hands.
Social scientists rely heavily on secondary data for their research. Those who collect data themselves want to—and may well have to—share their work, but they don’t know how to, because much of their data includes private information about their research subjects.
But I’m starting to think that some additional documentation can help with giving and getting access to these restricted data.
So this presentation will share a preliminary framework for thinking about what additional documentation might be needed in order to help enable access to and reuse of restricted data. I’ll also offer a handful of suggestions for fulfilling that need.
This is preliminary and open to feedback, new ideas, etc.!
---------------------
Let’s start from the beginning.
By “restricted data,” I mean data that could simply be sensitive or could possibly even cause harm to people or property.
Despite the sensitive nature of some social science data, we have a greater chance of being able to share restricted data from the social sciences than we do data from other fields that are bound by, say, HIPPA regulations (e.g., biomedicine) or export control (e.g., engineering).
Sensitive data in the social sciences that are restricted are typically identifiable information. And since social scientists are often funded by NSF, NIH, and the like, they’re also bound by requirements to share their data. But that’s hard to do, given the need to protect the privacy of study participants.
Where are we now with sharing restricted social science data? There’s currently a great deal of work being done at a select number of academic campuses to facilitate access to restricted-use social science data that are provided by federal agencies and by organizations like ICPSR and NORC.
Developing physical and virtual data enclaves in which restricted data can be securely, safely stored and used (besides the population research centers out there, this is happening at place like Cornell, Emory, Johns Hopkins, UVa, Wisconsin, Rutgers)
Improving processes for getting restricted data use agreements—that is, contracts for secondary data use—signed (At Penn State, there are currently three different units that can sign data use agreements, and it’s unclear who or under what circumstances PIs should go to one rather than the others; and they certainly aren’t coordinated to ensure that security requirements and negotiations that take place in one office are consistent with those in the other offices.)
------------------------
But really, the focus of this presentation is on us in this room.
How can WE—data managers or curators at universities or in research organizations—serve as the providers of sensitive data rather than just facilitators? Sure, facilitation is hard, but I want to ask how we can play a role akin to that of the Department of Education, which publishes the data from the National Center for Education Statistics? How can we provide access to the data collected by our researchers, rather than only facilitating access to data that belong to other entities?
In a few places, we already see the university stepping up as a provider of restricted data: e.g., Health and Retirement Study & Michigan Center on the Demography of Aging (Michigan), Add Health (UNC-CH)
But I want to address what folks like us are more likely to encounter: the myriad data sets that are much smaller than those from Pop Centers that are being created by our social scientists all the time.
Two Penn State examples: Jenny Trinitapoli’s data on religion and health in Africa; the Tremin Research Program on Women’s Health, a longitudinal study of women’s reproductive health
These are data that need to be shared—whose greatest contributions to research are in those sensitive data that would be obscured in an anonymized public version of these files—but their PIs don’t know how to share them because those researchers don’t have an entire center or program dedicated to sharing their data and they don’t have the money to take advantage of services like those provided by ICPSR
I’ll briefly mention that advice is emerging on how to plan early for sharing human subjects data—Elizabeth Buchanan at UW-Stout is a leading figure in this conversation; speaks to how researchers might prepare their informed consent language for easier reuse.
But while we can be proactive and try to get in on the ground floor of a project, at this stage, we’re still more likely to have a PI approach us at the end of a project, asking for help sharing her data.
So what can we do with the data that fall in our laps? How can we share these data responsibly? Answer so far:
Let’s create a public-use files of these data: We have pretty great guidance for how to do that (the report from ANDS, as the most recent example, but also from the UK Data Archive).
While an anonymized, public dataset may be sufficient to help address some research questions, anonymization can obscure the information that is likely to be most useful to other researchers. (Ex: being able to drill down to smaller geographic areas in order to speak to distinctions between neighborhoods. Allows for more nuanced investigations.) These research questions may well only be addressed through access to the restricted files.
But it can be such a pain to actually get ahold of and use these data for all the reasons I’ve already mentioned—finding secure spaces to work; getting contracts signed; but also convincing IRB it’s necessary to work with these data. (Reference proposed Jisc study http://bit.ly/1CUZQP6 )
Now I’d like to try to provide a framework for how we might help document restricted social science data in order to facilitate access and reuse. So that we can be the data providers, not just facilitators.
Caveat again: just some preliminary ideas here and many may seem obvious. I’d love to hear your reactions.
If we, data managers/curators/repository staff, want to help provide this kind of first-level access to the many sensitive datasets on our campuses, we can’t just lock dozens of DVDs or external hard drives in a closet only we have the key to and then see what comes at us.
It’d be great if this was all it took to share these data.
But really, there are a ton of detours between these PIs.
So, we have to think strategically about who and what is responsible for all those detours and ask how we can prepare for them.
Whom do we need to talk to in order to navigate this crazy route?
Original PI, who knows what they gathered and how they gathered it
Secondary PI, who knows how they’ll use the data and the risk of disclosure when they use the data, given their knowledge of the field
On-campus policy officers and contract negotiators: IRB (human subjects), Office of Sponsored Programs. Anyone who determines under what conditions data may and can be shared and used.
On-campus IT (security folks), who need to talk with those policy makers and enforcers in order to ensure compliance and consistency in securing the data.
Finally, we also need to involve advocates a higher level, e.g., NSF program officers, who say that sharing these data is important. If we don’t have them behind is, it’ll be hard to motivate our local policy makers and enforcers to take this risk on.
Once we understand the concerns of all these parties and navigate this tricky route between PIs…
…MAYBE we can actually share this stuff.
-------------------------------------
When we first get all these people involved in this kind of conversation, that conversation is often limited to what we can’t do. IRB, Risk Management, OSP: their goal is to mitigate risk and protect subjects; it’s understandable.
But what happens, as a result? What generally happens to a restricted dataset at the end of a project? Best case scenarios: it stays on lockdown, accessible only to the research team; an anonymized file is produced and shared—a public-use dataset.
With the exception of those big, federally-supported datasets mentioned earlier or those that end up in ICPSR or another national data archive (b/c they have a lot of money to pay ICPSR or another archive to curate them), there is rarely ever any useful information available about the datasets on lockdown or anything useful about the restricted versions of public datasets.
Rare even to have metadata records so that potential users could know these resources exist. Honestly, though, that would be one good place to start: creating tombstone metadata records in our institutional repositories that don’t link to the restricted dataset, but at least provide contact information for the gatekeeper. This record could also contain a field specifying the embargo periods for the data and trigger dates for public releases.
But that’s just a start. Once researchers know about these restricted datasets, how then can we provide access to them in a secure and responsible way?
If we keep pushing and asking how we can share instead of why we can’t, we’ll need to turn to the policies that govern all this. So, when a researcher wants to use a restricted dataset, she might run into a handful of challenges.
Here are some things we’ll need to keep in mind.
First, we need to keep in mind the ethical treatment of human subjects participants by the secondary data user and approvals for their data use agreements.
One challenge is that the IRB has trouble making a determination about the risk to human subjects risks in a dataset that the IRB can’t see. And there’s a parallel challenge of trying to evaluate the risks that might arise when dataset A is linked to dataset B when the Board can’t examine one or both of those datasets.
IRB has trouble knowing what exactly participants have consented to w/r/t how the data might be used in a study other than the one they were direct participants in.
Still on the topic of general research ethics: when trying to protect sensitive data, we often do so at the risk of eliding important contextual information, which in turn could lead to the misuse or misappropriation. This is an especially grave risk when personal human subject information is involved.
These ethical matters also motivate the regulations included in data use agreements. What’s more, the issue of DUAs bleeds into technology requirements for ensuring the safe use and storage of these data.
DUAs force—or should force—us to think about the technology used to transport, store, get access to, and work on the restricted data. They’re really about how to do all that safely and securely.
But the groups on our campuses that sign DUAs—OSP/Risk Management/Purchasing—often have trouble interpreting what security protections need to be in place for storing and using the data, who can implement those protections, and who is ultimately responsible, should there be a breach.
Another consideration has to do with the fact that researchers might need to be able to make efficient use of the data in that secure environment. This is because there will certainly be limitations on who can get access to the data once acquired, and there might be limitations on how much time those who can work in the data are allowed to so.
So, for example: Before the researcher herself gets access, we need to think about how her programmers/graduate students can prepare code for a restricted-use file using only a public-use copy of the dataset
Finally, when we run up against these challenges and feel helpless, we should keep in mind how providing access to the data could potentially benefit the greater good and let that motivate us to keep pressing on.
(To be honest, this is an overarching issue to keep in mind as we think about our need to share restricted data; the reminder comes from an NSF program officer): Assuming that, these days, practically any sensitive data—even if they’re “anonymized”—are reidentifiable—which may well be the case—how damaging would it be to the subjects for these data to be shared in a controlled way? And is that risk outweighed by greater beneficence or justice these data can help issue? The Belmont Report—which addresses Ethical Principles and Guidelines for the Protection of Human Subjects—does say the latter two tenets of beneficence and justice must be individually just as important as respect for persons when we weigh the basic ethical principles of research.
With these considerations in mind, what can we do? What additional documentation can we—as data providers at relatively small scale—make available in order to help address those challenges?
Here are some untested recommendations.
With respect to the first point, we can provide detailed documentation about: which variables individually or in combination pose disclosure risks; why they do; and how those risks might change given the secondary user’s areas of expertise or other considerations.
Disclose potential risky linking with other datasets
Include copies of informed consent language
Provide even more cultural context, for example, for the data in order to try to prevent misuse and misappropriation of them
On the second point, we can draft flexible data use agreements could be provided as an appendix to the dataset; they’d detail IT specifications for securely transporting, storing, providing access to, and using the data. But they could be adjusted, depending on the secondary PI’s institutional context.
Enabling efficient use of the data ties in with ideas already mentioned in the context of human subjects review: information on precisely how variable names have been changed in, say, creating the public file of a restricted dataset, and how the coding in the restricted file differs from the public one.
Finally, we can help promote use of the data as a contribution to the greater good: we can provide language helping to articulate how the IRB, for example, might want to weigh the three tenets of ethical research when making its decision about whether a PI should be approved to use the restricted dataset.
-----------------
This where I am now as I try to tease out the different challenges we face in sharing those restricted data dropped in our laps: it largely boils down to concerns related to human subjects protections and technology. It would certainly be easier to address those concerns before the data are gathered—before the original study goes to the IRB—but since that’s still rarely the case, I hope I’ve sketched out at least a framework for how we can approach the question of sharing restricted data collected by our local researchers.