Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Managing and publishing sensitive data in the social sciences - Webinar transcript


Published on

Transcript of the 29th March ANDS webinar.
Slides and recording are available from the ANDS website:

Published in: Education
  • Be the first to comment

  • Be the first to like this

Managing and publishing sensitive data in the social sciences - Webinar transcript

  1. 1. [Unclear] words are denoted in square brackets Managing and publishing sensitive data in the social sciences – ANDS Webinar 29 March 2017 Video & slides available from ANDS website START OF TRANSCRIPT Kate Lemay: Good afternoon - or good morning if you're over in Perth time zone - to everyone. Thank you for calling into our webinar today. We've got some handouts in today's webinar as well. We've got a guide to publishing and sharing sensitive data; that is an ANDS resource, and also an ANDS sensitive - it's called sensitive decision tree. That’s a one page summary of the information that’s available in our guide. I'd just like to introduce our two guests today. We've got Professor George Alter. He's a research professor in the Institute for Social Research and Professor of History at the University of Michigan. His research integrates theory and methods from demography, economics and family history, with historical sources to understand demographic behaviours in the past. From 2007 to 2016 he was the Director of the Inter-university Consortium for Political and Social Research, ICPSR, the world's largest archive of social science data. He's been active in international efforts to promote research transparency, data sharing and secure access to confidential research data. He's currently engaged in projects to automate the capture of metadata from statistical analysis software and to compare fertility transitions in contemporary and historical populations. We're lucky to currently have him as a visiting professor at ANU. Dr Steve McEachern is a Director of the Australian Data Archive at the Australian National University. He holds a PhD in industrial relations
  2. 2. Managing_and_sharing_sensitive_data Page 2 of 26 and a graduate diploma in management information systems and has research interests in data management and archiving, community and social attitude surveys, new data collection methods and reproducible research methods. Steve has been involved in various professional associations in survey, research and data archiving over the last 10 years and is currently chair of the executive board of the data documentation initiative. Firstly, we're going to hand over to George, who's going to share the benefit of over 50 years of ICPSR managing sensitive social science data. Over to you George. George Alter: Thank you, Kate. It's a pleasure to talk to you today. ICPSR, as Kate mentioned, has been data archiving for more than 50 years, and an increasing amount of our effort has gone into devising safe ways to share data that have sensitive and confidential information. At the heart of everything we do in terms of protecting confidential information is a part of the research process where. when we ask people to provide information about themselves to us, we make a promise to them. We tell them that the benefits of the research that we're going to do are going to outweigh the risk to them, and we say that we will protect the information that they give us. We have a lot of data that we receive at ICPSR and here at the ADA that include questions that are very sensitive. Often we're asking people about types of behaviour that could cause them harm, that we might be specifically asking them about criminal activity. We might be asking them about medications that they take that could affect how - their jobs or other things, so we have to be careful about it. We're afraid that if the information gets out it could be used by various actors for specific purposes, could be used in a divorce proceeding. Sometimes we interview adolescents about drug use or sexual behaviour, and we promise them that their parents won't see it and so on.
  3. 3. Managing_and_sharing_sensitive_data Page 3 of 26 In the data archiving world we often talk about two kinds of identifiers. There are direct identifiers - which are things like names, addresses, social security numbers - many of which are unnecessary, but some types of direct identifiers - such as geographic locations or genetic characteristics - may actually be part of the research project. Then the most difficult problem, often, is the indirect identifiers. That is to say characteristics of an individual that, when taken together, can identify them. We refer to this often as deductive disclosure, meaning that it's not obvious directly, but if you know enough information about a person in a data set, then you can match them to something else. Frequently we're concerned that someone who knows that another person is in the survey could use that information and find them, or that there is some other external database where you could match information from the survey and re-identify a subject. Deductive disclosures often are dependent on contextual data. If you know that a person is in a small geographic area, or you know that they're in a certain kind of institution, like a hospital or a school, it makes it easier to narrow down the field over which you have to search to identify them. Unfortunately, in the social sciences, contextual data has become more and more important. People now are very interested in new things like the effect of neighbourhood on behaviour and political attitudes or the effect of available health services on morbidity and mortality. There are a number of different kinds of contextual data that can affect deductive disclosure. We're in a world right now where our social science researchers are increasingly using data collections that include items of information that make the subjects more identifiable. For example, people studying the effectiveness of teaching often have data sets that have the characteristics of students, teachers, schools, school districts. Once you put all those things together it becomes very identifiable.
  4. 4. Managing_and_sharing_sensitive_data Page 4 of 26 We at ICPSR - and, I think, the social science data community in general - have taken up a framework for protecting confidential data that was originally developed by Felix Ritchie in the UK that talks about ways to make data safe. I'm going to go through these points, but Ritchie talks about safe data, safe projects, safe settings, safe people and safe outputs. The idea of this is not that any one approach solves the problem, but that you can create an overall system that draws from all of these different approaches and uses them to reinforce each other. Safe data means taking measures that make the data less identifiable. Ideally, that starts when the data are collected. There are things that data producers can do to make their data less identifiable. One of the simplest things is to do something that masks the geography. If you're doing interviews it's best to do the interviews in multiple locations that adds to the anonymisation of your interviewees. Or, if you're doing them in only one location, you should keep that information about the location as secret as possible. Once the data have been collected - research projects have been using a lot of different techniques for many years to mask the identity of individuals. One of the - the most common one is what's called [pop coding], where if you ask your subjects about their incomes, the people with the highest incomes are going to stand out in most cases, so usually you group them into something that says people above $100,000 in income, [or something like that] so that there are - there's not just one person at the very top, but a group of people, which makes them more anonymous. This list of things that I've given here - which goes from aggregation approaches to actually affecting the values - is listed in terms of the amount of intervention that’s involved. Some of the more recently developed techniques actually involve adding a noise or random numbers to the data itself, which tends to make it less identifiable, but it also has an impact that you can do with the data.
  5. 5. Managing_and_sharing_sensitive_data Page 5 of 26 Safe projects means that the projects themselves are reviewed before access is approved. At most data repositories, when the data need to be restricted because of sensitivity we ask the people who apply for the data to give us a research plan. That research plan can be reviewed in several different ways. The first two things are things that we do regularly. At ICPSR we ask, first of all, do you really need the confidential information to do this research project and, if you do need it, would this research plan identify individual subjects? We're not in the business of helping marketers identify people for target marketing, so we would not accept a research plan that did that. There are also projects that actually look at the scientific merit of a research plan. To do that though you need to have experts in the field who can help you to do that. Safe settings means putting the data in places that reduce the risk that it will get out. I'm going to talk here about three approaches. The first one is - four approaches actually. The first one is data protection plans. When we - for data that are - need to be protected, but the level of risk is reasonably low, we often send those data to a researcher under a data protection plan and data use agreement - which I'll come to in a couple of minutes. The data protection plan specifies how they're going to protect the data. Here's a list of things that we worry about that one of my colleagues at ICPSR made up. One of the things we tell people is what happens if your computer is stolen? How will the confidential data be protected? There are a number of things that people can do, like encrypting their hard disk, locking their computers in a closet, where they're not being used, that can address these things. I think that data protection plans need to move to just a general consideration of what it is that we're trying to protect against and allow the users to propose alternative approaches rather than saying oh, you have to use this particular
  6. 6. Managing_and_sharing_sensitive_data Page 6 of 26 software or this or that. We have to be clear about what we're worried about. A couple of notes about data security plans: data security plans are often difficult, partly because of the approach that has been taken in the past, and also because researchers are not computer technicians, and we're often giving them confusing information. One of the ways that I think, in the future - in the US at least - universities are going to move beyond this is - I'm seeing universities developing their own protocols where they use different levels of security for different types of problems. At each level they specify the kinds of measures that researchers need to take to protect data that is at that level of sensitivity. From my point of view, as a repository director, I think that any time that the institutions provide guidance it's a big help to us. The other way is to make the data safe by making - putting it in a safe setting - is actually to control access. There are three main ways that repositories control access. One kind of system is what [I'd call a] remote submission and execution system where the researcher doesn’t actually get access to the data directly. They submit a program code or a script for a statistical package to the data repository. The repository runs the script on the data and then sends back the results. That’s a very restrictive approach, but it's very effective. More recently, however, a number of repositories and statistical agencies have been moving to virtual data enclaves. These enclaves - which I'll illustrate briefly in a minute - use technologies that isolate the data and provide access remotely but restrict what the user can do. The most restrictive approach is actually a physical enclave. At ICPSR we have a room in our basement that has computers that are isolated from the internet. We have certain data sets that are highly sensitive. If you want to do research with them, you can but on the way into the enclave we're
  7. 7. Managing_and_sharing_sensitive_data Page 7 of 26 going to go through your pockets to make sure you're not trying to bring anything in, and [unclear] the way out we're going to go through your pockets again and you'll be locked in there while you're working because we want to make sure that nothing that uncontrolled is removed from the enclave. The disadvantage of a physical enclave is that you actually have to travel to Michigan to use those data, which could be expensive. That’s the reason that a number of repositories are turning to virtual data enclaves. This is a sketch of what the technology looks like. What happens is that you, as a researcher, look over the internet, log on to a site that connects you to a virtual computer. Then that virtual computer is in contact - is - has access to the data, but your desktop machine does not. You only can access the data through the virtual machine. At ICPSR we actually use this system internally for our data processing to provide an additional level of security. We talk about the virtual data enclave, which is the service we provide to researchers, and the secure data environment, which is where our staff works when they're working on sensitive data. It's a little bit of a let-down, but this is what it actually looks like. What I've done here is - the window that’s open there with the blue background is the - our virtual data enclave. I've opened a window for [unclear] inside there. The black background is my desktop computer. If you look closely, you'll see in the corner of the blue box that you see the usual Windows icons, and that’s because when you're operating remotely on - in the virtual enclave you're using Windows. It looks just like Windows and acts just like Windows, except that you can't get to anything on the internet. You can only get to things that we provide for a level of security. On top of that the software that’s used - we use [VMware] software, but there are other brands that do the same thing - essentially turns off your access to your printer, turns off your access to your hard drive
  8. 8. Managing_and_sharing_sensitive_data Page 8 of 26 or the USB drive so you cannot copy data from the virtual machine to your local machine. You can take a picture of what you see there, but that - because you have that capability, we also restrict people with data use agreement. That’s my next topic: how do you make people safer? The main way that we make people safer is by making them side data use agreements or by providing them training. The data use agreements used at ICPSR are, frankly, rather complicated. They consist of the research plan, as I mentioned before. We require people to get IRB approval for what they're doing, a data protection plan, which I mentioned, and then there are these additional things of behavioural rules and security [pledges] and an institutional signature, which I'll mention now. The process - if you look at the overall process of doing research, there are a number of legal agreements that get passed back and forth. It actually starts with an agreement made between the data collectors and the subject, in which they provide the subjects with informed consent about what the research is about and what they're going to be asked. It's only after that that the data go from the subject to the data producers. Then the data archive - such as ICPSR or ADA - actually reaches an agreement with the data producers in which we become their delegates for distributing the data. That’s another legal agreement. Then, when the data are sensitive, we actually reach - have to get an agreement from the researcher - and these are pieces of information we get from the researcher - and, in the United States, our system is that the agreement is actually not with the researcher, but with the researcher's institution. At ICPSR, we're located at the University of Michigan, and all of our data use agreements are between the University of Michigan and some other university, in most cases. There are some exceptions. It's
  9. 9. Managing_and_sharing_sensitive_data Page 9 of 26 only after we get all of these legal agreements in place that the researcher gets the data. One of the things in our agreements at ICPSR is a list of the types of things that we don’t want people to do with the data. For example, we don’t want someone to publish a table, a cross-tabulation table, where there's one cell that has one person in it, because that makes that person more identifiable. There's a list of these things, I think - often we have 10 or 12 of them - that are really standard rules of thumb that statisticians have developed for controlling re-identification. The ICPSR agreements are also, as I said, agreements between institutions. One of the things that we require is that the institution takes responsibility for enforcing them, and that if we at ICPSR believe that some has gone wrong, the agreement - the institution agrees that they will investigate this based on their own policies about scientific integrity and protecting research subjects. DUAs are not ideal. They're actually - there's a lot of friction in the system. What currently - in most cases - a [PI] needs a different data use agreement for every data set, and they don’t like that. We can, I think, in the future, reduce the cost of data use agreements by making institution-wide agreements where the institution designates a steward who will work with researchers at that institution. There's already an example of this: the [Dayberry] project - which is a project in developmental psychology that shares videos - has done very good work on legal agreements. My colleague - the current director at ICPSR, Margaret Levenstein - has been working on a model where a researcher gets a data use agreement from one data set - can use that to get a data use agreement for another data set so that individuals can be certified and include that certification in other places. One of the things that I think we need to do more about is training. A number of places, like ADA, train people who get confidential data. We've actually done some work on developing an online tutorial about
  10. 10. Managing_and_sharing_sensitive_data Page 10 of 26 disclosure risk, which we haven’t yet released, but it's, I think, something that should be done. Finally, there's safe outputs. One of the - the last stage in the process is that the repository can review what was done with the data and remove things that are a risk to subjects. This only works if you retain control, so it doesn’t work if you send the data to the researcher, but it does work if you're using one of these remote systems like remote submission or a virtual data enclave. Often, this kind of checking is costly. There are some ways to automate part of it, but a manual review is almost always necessary in the end. A last thing about the costs and benefits: obviously, data protection has costs. Modifying data affects the analysis. If you restrict access you're imposing burdens on researchers. Our view is that you need to weigh the costs with the risks that are involved. There are two dimensions of risk. One dimension is in this particular data set what's the likelihood that an individual could be re-identified if someone tried to do it and, secondly, if that person was reidentified, what harm would result? We think about this as a matrix, where you can see in this figure, as you move up you're getting more harm. As you move to the right you're increasing the probability of disclosure. If the data set is low on both of these things - for example, if it's a national survey where 1000 people from all over the United States were interviewed and we don’t know where they're from and we ask them what their favourite brand of refrigerator is, that kind of data we're happy to send out directly over the web without a data use agreement with a simple terms of use. But as we get more complex data with more questions, more sensitive questions, we often will add some requirements in the form of a data use agreement to assure the data are protected. When we get to complex data where there is a strong possibility of re-identification and
  11. 11. Managing_and_sharing_sensitive_data Page 11 of 26 where some harm would result to the subjects, in that case we often add a technology component like the virtual data enclave. Then there are the really seriously risky and sensitive things. My usual example of this is we have a data set at ICPSR that was compiled from interviews with convicts about sexual abuse and other kinds of abuse in prisons. That data is very easy to identify and very sensitive. We only provide that in our physical data set. That’s the end of my presentation. Thank you for your attention. We'll take questions later. Kate LeMay: Great. Thank you, George. We'll pass over to Dr Steve Steve McEachern to give his presentation about managing sensitive data at the Australian Data Archive. Steve McEachern: My aim today is to build off what George has talked about, particularly taking the five safes model and looking at what the situation is in the Australian case. I'll talk about the Australian Data Archive and how we support sensitive data, but I want to put it in the context of the broader framework of how we access sensitive data in Australian social sciences generally. I'm going to talk about some of the different options that are around, picking up on some of what George has discussed in terms of some of the alternatives that are available and demonstrate the different ways these are in use here in Australia. I'm really focussing more on the five safes model and its application in Australia than I am specifically on ADA. As I say, we are one component of the broader framework for sensitive data access here. Just to say - I mean what I really wanted to cover off here is thinking about sensitive and the five safes model. I'll look at the different frameworks for sensitive data, access in Australia and where you might find them, and then how we apply the five safes model at ADA in particular. Then, being on time, I might say something briefly about the data life cycle and sensitive data as we go through.
  12. 12. Managing_and_sharing_sensitive_data Page 12 of 26 I wanted to just pick up on - particularly the ANZ definition here of sensitive data. As I say - well, I'll frame this in the context of most of what we deal with at ADA - at some point in its life cycle has been sensitive data. It's more often it's information that’s collected from humans, often with some degree of identifiability, at least at the point of data collection, not necessarily the point of distribution. A lot of what we deal with - and this is true for a lot of social science archives - has been subject to - would fall into the class of sensitive data. There's a distinction there between what we get and what we distribute that we would probably draw a distinction. In terms of our definition here - this is the handout, I think, that’s in the handout section, and it's available online - that can be used to identify an individual species, object, process or location being introduced to the risk of discrimination, harm or unwanted attention. We tend to think in terms of human risks more than anything else, the risk to humans and individuals, but it does apply in other cases as well. For example, the identification of sites for Indigenous art might, in and of itself, lead more people to want to go and visit that location and, in a sense, destroy the thing that you're actually trying to protect; so the more visits that they actually get, the more degraded the art itself becomes. It doesn’t just hold for human research, but that’s probably our emphasis at ADA. Just to reiterate the five safes again, we talk about five things: people, projects, settings, data and outputs and the reference here, as I say. Down the bottom you can look at the document that Felix Ritchie and two of his colleagues developed, framing out the five safes model. What I would say about this is say it's been adopted directly with our UK data service. That’s where it has its origins. The basic principles are applied in a lot of the social science data archives, and it's now actually been adopted by the Australian Bureau of Statistics as well. Their framework - they're thinking about output of
  13. 13. Managing_and_sharing_sensitive_data Page 13 of 26 different types of publications. Literally, [unclear] this model. [Unclear] it's quite a useful framework for talking about. I'm going to take a slightly different approach to George in thinking about how we think about what we're worried about. I'm going to take - as a depositor you worry about the risk of disclosure. As a researcher, what's the flip side of that? Why do we need access to sensitive data? What does it provide? The National Science Foundation, about four or five years ago, put out a call around how could we improve access to microdata, particularly from Government sources. It highlights the sorts of things - why we talk about the need for access is it - the sorts of research you can do. This comes from a submission from [David Card], Raj Chetty and several economists in the US and elsewhere. They were highlighting what's needed. Direct access is really the critical thing here, direct access to microdata. By microdata we mean information about individuals, line by line, aggregate statistics, synthetic data. We can create fake people, as it were, [or] submission of computer programs for someone else to run really don’t allow you to do the sorts of work you need to answer policy questions in particular. A lot of the particular social policy research is focussed in this way. In order to do certain things, access to this data is necessarily. How do we facilitate that, taking account of the sorts of concerns that have been raised? [On site] that is. How do people expect to access it? This was an interesting blog post from a researcher based at, previously, the University of Canterbury, comparing how you access US census data versus the New Zealand census and, similarly, you could say the same for the Australian census as well. In the US you can get a one per cent sample of the census and you just go and download a file directly. It's open as what's called a public use microset file. Those are directly available. In New Zealand, there's a whole series of instructions you have to go through. You might be
  14. 14. Managing_and_sharing_sensitive_data Page 14 of 26 subject to data use agreements. You might be subject to an application process et cetera, et cetera. He's criticising, saying it should be much easier, it should be the US model that’s appropriate here rather than the New Zealand model. What we're really probably talking about here is that both are appropriate depending upon the sorts of detail, the sorts of identifying information that are available. Both might be valid models. They just allow you to do different things. The first model really focusses on, in a sense, masking the data to some degree, in some of the safe data models that George talked about. The other uses other types of - aspects of the safe model to do - address confidentiality concerns. What you also find is researchers understand these, but there has to be some trade-off. The recognition of the need for confidentiality is recognised and understood, and that there may well be - there ought to be trade-offs in return for that. For example, Card and his colleagues suggest that there is a set of criteria that you could put in place for enabling what form of access to microdata, to sensitive microdata. It might - they reference access through local statistical offices, through some remote connections such as the virtual enclave that George talked about, and monitoring of what people are doing. If you're going to have highly sensitive data available, the trade-off for that for access should be appropriate monitoring. So there is a recognition that these - I mean this is just one possible approach, but a recognition that access brings with it responsibilities and appropriate checks and balances. What I want to talk about is how has that eventuated in Australia, what do we see? [This bubble here]. The sorts of models that we see here in Australia - I've broken them out here broadly. I'd say four broad areas, but the one that people are probably most familiar with is the ABS, the Australian Bureau of
  15. 15. Managing_and_sharing_sensitive_data Page 15 of 26 Statistics. They have a number of systems and access methods that suit different types of safe profiles. These include what's called confidentialised unit record files, or CURFs; what have they - remote access data lab, which is one of their online execution systems. They have an on-site data lab. You can go to the bowels of the ABS buildings in - certainly in Canberra and, I believe, in other states as well, and do on-site processing. Then they have other systems. Probably, the best known of these is what's called the table builder, which is an online data aggregation tool which does safe data processing on the fly. Our emphasis at ADA is primarily on these confidentialised unit record files, so we write unit record access and some aggregated data access as well. Then we have the remote execution - [or one of the] remote analysis environments. I put under this model the Australian research infrastructure network [for jet graphic] data access in particular. The secure unified research environment produced by the Population Health Research Network is an example of George's remote access environment as well, and even data linkage facilities. Another part of the PHRN network fits some degrees under this type of secure access model. That’s, in a sense, a more extreme version of that. Then we have other ad hoc arrangements as well; things like the physical secure rooms. A number of institutions have a secure space. There are a number here at ANU, for example. Then you might have other departmental arrangements as well that exist. We can probably classify those in terms of the distinction in the type of approaches that we have. What I've done here is just a very simple assessment from not at all to a very strong yes. It fits within this sort of - addressing this safe element - from low to high. I have some question marks on some of the facilities, particularly [sure, without] linkage facilities, not because I don’t think they can do it. It's I don’t have enough information to make an assessment there.
  16. 16. Managing_and_sharing_sensitive_data Page 16 of 26 If you look at the different types, things like the ABS models have tended towards safe data, those sorts of confidentialisation and [unclear] routines, [datage] - output checking and secure access models. Tabulations are a secure access model as well. They’ve tended less towards safe people and safe projects, so of checking of people and checking of projects. We tend to more - a lot of cases - there's more trust in the technology than there is the people using the technology, which I think is a little bit problematic, given that there are - and I'm going to talk to this. There are some fairly good processes in Australia for actually assessing the quality of people in particular and, to some extent, the projects. This is - again, we can profile - the point here I'm making is that different - you have different alternatives to how you might make sense of the data available. There's not a one solution. It's what's the mix of things that I might do - and I'll come back to that at the end. In the Australian experience, [as I say], we have a strong emphasis on safe data. We came up with the term in Australia called - of confidentialisation. That’s probably the term you'll see most regularly. Anywhere else in the world you would hear the term anonymization. I'm not quite sure why this is a case but, as I say, in Australia the term is - we tend to use confidentialisation. The Australian Data Archive used this model. The ABS are, and the Department of Social Services, things like the Household Income Labour Dynamics in Australia, use anonymisation techniques as the starting point. You can make data safe before you release it. It has its limitations. A good example of that is some of the data sets were released into the environment used anonymisation. Safe data is the priority. The potential for it to be reverse engineered - if you haven’t done your anonymisation properly, then you have - it could be reversed and you get a safe data risk, so it has its flaws.
  17. 17. Managing_and_sharing_sensitive_data Page 17 of 26 This is why we tended towards looking at a combination of techniques. As George pointed out, [unclear] if the risk of actually being identified is low - and particularly the harm that comes from that is low - then it may be the case that this is sufficient. Certainly, a lot of the content that we have at ADA, most of our emphasis is actually on the safe data more than anything else. Safe settings: we do have - as I say, in examples here, tabulation systems, things where you can do cross-tabs online are fundamentally a safe settings model. People don’t get access to the unit record data. They just get access to the systems to produce outputs. Remote access systems: the access data lab, PHRN, sure system, and a new system that the ABS are bringing on, their remote data lab. They're making their data labs available in a virtual environment. They're in pilot stages that we're working with them on at the moment - are increasingly being used as well. They're also secure environments, because I mentioned the data lab and the secure rooms. Safe outputs: a number of the safe settings environments - because they tend to use highly sensitive data - have safe output models as well. The real problem has been with these in scaling them. It requires manual checking more often than not. Reviewing the output of these sorts of systems, that requires people. That requires time. It's hard to automate as well. The ABS have invested a lot of money into automating output checking, in point of fact. Their table builder system is one of the best that’s around, but their new remote lab still has manual checking of outputs. It depends on what you're trying to do and the sorts of outputs you're checking as to the extent to which you - sorry, the sorts of outputs you're producing as to whether you can actually automate the checking as well. The other side of this that I think will become increasingly relevant too is the replication and reproducibility elements of things that come out
  18. 18. Managing_and_sharing_sensitive_data Page 18 of 26 of systems like this. How are we going to facilitate the replication models within those environments? I'm not sure that question's been addressed yet. Safe researchers and safe projects in Australia - to be frank, they are considered in most models, but they're not really closely monitored. That’s because they're difficult to monitor. How do you follow the extent to which people follow the things that they’ve signed up to? Anyone who's involved in reporting of research outputs for ERA or anything will know that getting people to actually fill out the forms - putting in place what have I produced would be hard. Filling out forms to say have I compliant with a data use agreement is [unclear]. That said, we do have some checks and balances that are there. Certainly, in terms of the ethics models and the codes of conduct for research, do provide some degree of vetting [assurance] for those that go through that sort of system. We have some checks and balances in places for - particularly for university researchers - to address the sorts of concerns. I think an emphasis increasingly on, say, researchers and projects might be one that we can leverage a bit more carefully. As I say, because of the frameworks we have in place - the Australian code of conduct - an increasingly professional association - and journal requirements as well for data sharing - are going to put a degree of assessment on the sorts of practices we use as well. In America it's the Economic Association, the DART agenda in their political science, [unclear] data sharing, these are actually a mechanism also for addressing partly the extent to which - or why - by sharing - assessing the sharing of data, but also assessing the extent to which you'll, essentially, [unclear] as well. That’s something to be considering in the future. I'll quickly turn to the ADA model and then wrap up. The ADA model - as I say, our emphasis is primarily on safe data. Data is anonymised. Either we - it tends to be through the agencies that provide - all the
  19. 19. Managing_and_sharing_sensitive_data Page 19 of 26 researchers that provide data to us in advance. We will also do some review on content as well. We'll provide recommendations back to our depositors as to these are the sorts of things you'll probably want to think about, in terms of have you included things like postcodes or occupational information? If I know someone's postcode, their occupation and their age, there's a fair chance that I can identify them in many cases, in remote locations in Australia in particular. There's some basic checks you can do. Certainly, safe people and safe settings - our data access is almost all mediated. You must be identified. You must provide contact information and [supervise …]. We do some checking on safe people and we're providing information on project descriptions, what do you intend to do with the data as well, particularly for where we have more sensitive content. Often that’s a requirement around depositors. We don’t apply, frankly, in safe settings and safe outputs. That’s not the space that we work in. We work with other agencies such as the ABS. Where there's access to certain sensitive content we'll point people to the relevant locations, where you've got highly sensitive content that you want to make available. As I say, something like the remote data lab, where was its focus? They focus less on safe data. They're a virtual enclave. They don’t prohibit the use of safe data practices, so they do limit - where you have highly sensitive data there's a more dedicated assessment process on a project of the outcomes. Highly safe settings. Sitting at the ABS. The problem is that the costs they have is in establishing the system itself. They vet all of the outputs. It has cost associated with it. They have safe people. [There is] training for researchers prior to accessing the system. There is some challenge in assessing the backgrounds of people, for example. How do you - this is where the need for domain experts - if you're going to fully assess people and
  20. 20. Managing_and_sharing_sensitive_data Page 20 of 26 projects and you're going to assess their domain expertise. You need domain experts to be able to [unclear] that sort of evaluation. The emphasis might well be on the - are you using appropriate techniques. Are you maintaining secure facilities, and are you potentially - what's the research [plan] itself look like is more the emphasis than the quality of the science. That’s a much harder to evaluate. Safe projects: that has been used in some places at the ABS. Sometimes it's required for legislative reasons. The extent of data release is dependent itself upon meeting a public good statement, for example. One of the questions for future for some organisations is should this matter? Basic research itself might generate useful insights that you didn’t expect. As I say, in some cases, again, you're going to be probably moving the levers, focus on different type aspects of the safe data environment. I guess the message we want to put through here is, certainly, there are sweeter options that are available for you for accessing sensitive data. Different models exist and they have different - ranges of the five safes. You can certainly incorporate safe people models. [Unclear], a lot of models focus on the expectation that we have an intruder. Hackers are coming in to access our system. Actually, what tends to be the case, more often than not, is the silly mistakes. I made a mistake by leaving my laptop on the train or leaving my USB in the computer lab. That’s far more common. We have - we tend to try and profile to default options in terms of our mix of safe settings but, as I say, there are options available to you. What you have to think about is what's appropriate for the [form of/formal] data that you're trying to work with. Fundamentally, the argument is that principles should enable the right mix of safes for a given data source. Kate LeMay: Thank you very much Steve. It was a really great overview about the different ways that the five elements of the safes can be mixed and
  21. 21. Managing_and_sharing_sensitive_data Page 21 of 26 using different [settings]. I thought it was really interesting that both of you mentioned that a safe location was in a basement [laughs]. I've just got these images of people locked up in basements. I also wanted to note that George mentioned data masking and using de-identification methods. Steve also mentioned confidentialisation, anonymisation. They're similar words for similar processes. ANZ has a de-identification guide available on our website now. If you're interested in that, it's more detailed than that information. We have our guide there that you can have a look at. I was also wondering about - George, you were talking about with the data protection plan and the data use agreement that the onus is on the institution, that if someone breaks it, that they need to put them through some sort of research integrity investigation or something like that. If that doesn’t happen is there any potential recourse for the university? Would ICPSR turn around and say you didn’t follow this process, you're not going to be accessing any of our data anymore? George Alter: Sure. Actually, on our website we actually list the levels of escalation that [we'll have to] go to. We can certainly cut off the institution from the - from access to ICPSR data, but what is - what really gets people's attention is that our - the National Institutes of Health in the US has an Office of Human Research Protections. If we thought that someone was breaching one of our agreements and endangering the confidentiality of research subjects, I would report them to that office. That office has a lot of power. They regularly publish the names of bad actors. What's more, they can cut off all NIH funding to universities. They have done that in the past when they thought that protections weren’t in place. I always think of that as the nuclear option. I know for a fact that university administrations and their trustees and agents are terrified that NIH will do something like that. Just waving that in front of a university compliance officer gets their attention.
  22. 22. Managing_and_sharing_sensitive_data Page 22 of 26 Kate LeMay: Steve, I was wondering, with your - the Australian Data Archive, with the use agreement, that people are signing that - is that with the individual user or with the institution, as it is with… [Over speaking] Steve McEachern: Primarily, it's with the individual. We have a small number of organisational agreements, but not many. There is - I would say there's more [unclear] - yeah, pinning a focus on an agreement between the individual and the organisation rather than at the organisation level. Some organisations do ask for them but, frankly, it's more actually for pragmatic reasons than it is for compliance reasons - is that they will want to host the content and manage access by requesting access to a particular data set for all members of their research team, for example. It just makes that easier, as it were. There are other models. As I say, the ABS model is - actually, the agreement is with the institution. Then individuals sign up to the institutional agreement. The Department of Social Services model is the same as well. It will be interesting to see the extent to which we move in one direction or another. I'd say I think the compliance argument hasn’t been one that’s been all that common here in Australia. It's actually been - except in the case of where you have Government data. I would say it's probably [unclear] situation. For academic produced data it hasn’t tended to be an emphasis. Kate LeMay: With George's agreement with institutions where the - the recourse is that the institution should then have some integrity investigation - what level of recourse do you have with… [Over speaking] Steve McEachern: [Limited]. Facilitator: …with the individual? Steve McEachern: Limited. I mean we probably report back to the institution to which they belong. [As I say], we do have the question supervisory arrangements. We would probably also follow some of the questions
  23. 23. Managing_and_sharing_sensitive_data Page 23 of 26 under the code of conduct for research. That’s why I make reference back to there is an overarching set of obligations on those within Australian academic institutions. We'll pursue something in that way. One of the challenges for us - and I'm going to guess for George as well - is just finding where you get breaches of compliance. One of the hardest things to do is actually find out what happened in the first place. We've had one case that I'm aware of - certainly my predecessors' lifetime - which is going back to the late '90s. It's not a common occurrence, but we're aware of it. Kate LeMay: George mentioned standardised data use agreements between US institutions. Has that been formalised across a number of institutions as part of a consortium arrangement? Or is it more of an informal and gaining momentum? George Alter: The example I gave is the Databerry project. They're the only ones I know that have done this in a formal way where they get institutions to sign on as an institution and then that covers all of the researchers at that institution. It took them a while to negotiate that and get the bugs out, but I think it's paying off for them. This is something that I think other groups like ICPSR should move to. Right now it's a big problem that about one in six of our data use agreements at ICPSR involved a negotiation between lawyers of the University of Michigan and lawyers at the other institutions. It's a major cost. I think it's one of the ways to go. Steve McEachern: I would say - I mean in Australia we have a pretty strong example, which is the university's Australia ABS agreement. I mean that model facilitates a whole lot of things. It's enabled access to the broad collection of ABS CURF data under a single agreement. The other is universities sign up for the cost that comes with that as well. They're paying a fee for that, but that - as I say, it covers the broad spectrum of what they can do. The challenge in some cases is what [unclear] have you got for dissemination of the content?
  24. 24. Managing_and_sharing_sensitive_data Page 24 of 26 As I say, if I went to the [next department] - I've had this discussion with various departments - could we establish a consistent data access agreement? It's because the departments themselves are set up under different models - from legislation, sorry. The impact of that is they can't necessarily have the same set of conditions. Certainly, there is some capacity to [unclear] some of that and, I venture, to see the extent of which the project [commission] report that’s [coming of] data access might address some of those questions as well. Kate LeMay: Just quickly, there's a question about are there any checklists or guidelines for new researchers to assess their research surveys for the level of confidentiality? I think that they're talking about privacy risk assessments. Steve McEachern: This is - actually, we have an internal checklist. This is something we've talked about in terms of thinking about whether you - what you need to do in terms - but it really depends on publication. We talked before about the fact that in order to do certain research you need to have actually some things that [might be identified], so it depends on which point in the data life cycle you're actually talking about here. When we're thinking about data release, then you - as I say, we would basically apply some basic principles for - these are the sorts of things that we look for. Actually, we've talked about making that checklist available, in terms of these are the sorts of things you have to be concerned about. There is advice around what we could probably bring together but, as I say, the - it's this usability versus confidentiality question again. One of the things we sometimes do is we split off those things that have a high confidentiality risk. We actually release [several] different sets of data, so that if you need that additional information you can actually make that available under a separate - additional set of requirements, possibly under a different technological setting. I think it depends a little bit on when in the life cycle you are talking about here. It often is useful to have as much - have information,
  25. 25. Managing_and_sharing_sensitive_data Page 25 of 26 particularly - for example, if you're a longitudinal study you must have identifying information going forward. You’ve got to be able to contact someone the next time round. It depends on what you're trying to achieve but, yeah, there are some basic advices that we put out. George Alter: There's a literature that’s been used by statistical agencies about what [unclear], but that whole area is right now somewhat contentious because the statistical agencies develop that literature largely in the age when data were released in the form of published tables. When the data are available online and you can do repetitive iterative operations on them, you're in a new world. There's a separate literature that’s developed in the computer science world. Anyway, it is a problem. There is guidance out there in really complex areas like in some health care areas. Doing a full assessment of a data set can be very complicated and difficult, so I think my recommendation is that people start at the basics and think about how would you identify this person, and if this information got out what harm would it cause? Often the researchers themselves have a good sense of that from the research they're doing. Kate LeMay: There's one last question: are the five safes applicable in all research disciplines? Or are they specifically limited to suit the social sciences? Steve McEachern: I think they're already applicable. Kate LeMay: I agree. Steve McEachern: I mean it's interesting. We were have a discussion here about the social sciences. For example, we work a lot with health sciences [unclear] environmental sciences [unclear]. It's - I don’t see any reason why they shouldn’t be applied elsewhere. I mean that’s - part of the question actually is - it's more [unclear] about what do you have to think about in terms of the privacy and confidentiality risks, far more so than what's the topic. The topic helps you make some sort of judgement about the harm, in George's terms, but yeah, it's the confidentiality questions that…
  26. 26. Managing_and_sharing_sensitive_data Page 26 of 26 [Over speaking] Kate LeMay: The framework is [unclear]. Steve McEachern: Yeah. George Alter: Oh yeah. Kate LeMay: Fabulous. Thank you very much to George and Steve for coming along to our webinar today, and thank you everyone for calling in… END OF TRANSCRIPT