One of the success stories of open research data is the research on Malaria “Open Source Malaria is discovering and developing new medicines for the treatment of malaria, a terrible disease that kills about 2000 people each day. The project&apos;s innovation is to use open source principles in drug discovery: Anyone can take part at any level of the project, all data and ideas are freely shared and there will be no patents. The project therefore operates without secrecy. About 100 people have so far contributed from all over the world - you can too. Traditionally research into new medicines is secretive. In part this is because the pharmaceutical industry needs patents in order to be financially sustainable. But the secrecy is a weakness because it can lead to expensive duplication of effort and different groups of scientists not learning from each other. Open Source Malaria removes the secrecy so that the research can be carried out more efficiently and the public and patient groups can become involved in the work. In much the same way as open source software has made such great products we all use (like the Firefox web browser and the Android operating system), we think that open source malaria can produce high quality, low-cost medicines that could help a lot of people.To date four sets of molecules have been pursued in the OSM consortium. The starting points came from big pharmaceutical companies like GlaxoSmithKline and Pfizer. The OSM community then worked to improve these compounds by making them more active against the malaria parasite and by making them longer-lasting in the body. The current series, Series 4, is able to cure mice of malaria. This means that we&apos;re close to taking a compound into clinical trials where it could be used by people. If that is successful we would have achieved something really new - we would have taken a public domain compound all the way to market where it could be made at low cost by generics manufacturers. If we can do this then we can start asking whether we could envisage a whole new pharmaceutical industry based on open source principles. Could we use open source pharma to discover new medicines for heart disease, or Alzheimer&apos;s? How about cancer, or medicines to fight superbugs? We won&apos;t know until we try. It would be fantastic if you could help do that.” From: http://thinkable.org/submission/2055 https://www.youtube.com/watch?t=105&v=gCOokjOiVTc
Why make research data open? There seems to be a consensus, especially amongst policy makers that open access to research data is a good thing; a desirable goal that the research community should pursue. Ideas about open access to research data tend to build on a set of norms, deriving from Mertonian norms: Research data is a public or common good Verify and reproduce Minimize costs Collaboration Open access would result in a number of benefits: Reinforces scientific inquiry, diversity of analysis and opinions, increased public understanding of and trust in science, stimulate business activity, helps to solve global challenges, etc.
A pair of computer scientists famously proved this point by combing movie recommendations found on the Internet Movie Database with the Netflix data, and they learned that people could quite easily be picked from the Netflix data. “On October 2, 2006 Netflix, the “world’s largest online movie rental service,” publicly released one hundred million records revealing how nearly a half-million of its users had rated movies from December 1999 to December 2005.90 In each record, Netflix disclosed the movie rated, the rating assigned (from one to five stars), and the rate of the rating. Netflix first anonymized the records, removing identifying information like usernames, but assigning a unique user identifier to preserve rating-to-rating continuity. Thus, researchers could tell that user 1337 had rated Gattaca a 4 on March 3, 2003, and Minority Report a 5 on November 10, 2003. Netflix had a specific profit motive for releasing these records. Netflix thrives by being able to make accurate movie recommendations; if Netflix knows, for example, that people who liked Gattaca will also like The Lives of Others, it can make recommendations that keep its customers coming back to the website. To improve its recommendations, Netflix released the hundred million records to launch what it called the “Netflix Prize,” a prize that took almost three years to claim. The first team that used the data to significantly improve on Netflix’s recommendation algorithm would win one million dollars. As with the AOL release, researchers have hailed the Netflix Prize data release as a great boon for research, and many have used the competition to refine or develop important statistical theories. Two weeks after the data release, researchers from the University of Texas, Arvind Narayanan and Professor Vitaly Shmatikov, announced that“an attacker who knows only a little bit about an individual subscriber can easily identify this subscriber’s record if it is present in the [Netflix Prize] dataset, or, at the very least, identify a small set of records which include the subscriber’s record.”97 In other words, it is surprisingly easy to reidentify people in the database and thus discover all of the movies they have rated with only a little outside knowledge about their movie-watching preferences. The resulting research paper is brimming with startling examples of the ease with which someone could reidentify people in the database, and has been celebrated and cited as surprising and novel to computer scientists.98 If an adversary—the term used by computer scientists99—knows the precise ratings a person in the database has assigned to six obscure movies,100 and nothing else, he will be able to identify that person 84 percent of the time.101 If he knows approximately when (give or take two weeks) a person in the database has rated six movies, whether or not they are obscure, he can identify the person 99 percent of the time.102 In fact, knowing when ratings were assigned turns out to be so powerful that knowing only two movies a rating user has viewed (with the precise ratings and the rating dates give or take three days), an adversary can reidentify 68 percent of the users.103 From: Ohm, Paul, Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization (August 13, 2009). UCLA Law Review, Vol. 57, p. 1701, 2010; U of Colorado Law Legal Studies Research Paper No. 9-12. Available at SSRN: http://ssrn.com/abstract=1450006 “First, we can immediately find his political orientation based on his strong opinions about “Power and Terror: Noam Chomsky in Our Times” and “Fahrenheit 9/11.” Strong guesses about his religious views can be made based on his ratings on “Jesus of Nazareth” and “The Gospel of John”. He did not like “Super Size Me” at all; perhaps this implies something about his physical size? Both items that we found with predominantly gay themes, “Bent” and “Queer as folk” were rated one star out of five. He is a cultish follower of “Mystery Science Theater 3000”. This is far from all we found about this one person, but having made our point, we will spare the reader further lurid details. “ From: http://arxivblog.com/?p=142 Another example: William Weld “The Massachusetts Group Insurance Commission had a bright idea back in the mid-1990s—it decided to release &quot;anonymized&quot; data on state employees that showed every single hospital visit. The goal was to help researchers, and the state spent time removing all obvious identifiers such as name, address, and Social Security number. But a graduate student in computer science saw a chance to make a point about the limits of anonymization.Latanya Sweeney requested a copy of the data and went to work on her &quot;reidentification&quot; quest. It didn&apos;t prove difficult. Law professor Paul Ohm describes Sweeney&apos;s work:At the time GIC released the data, William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. In response, then-graduate student Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office”. From: http://arstechnica.com/tech-policy/2009/09/08/your-secrets-live-online-in-databases-of-ruin/
Providing open access to research data makes it increasingly difficult to maintain the confidentiality and privacy of research subjects. The standard practice in human subject research, as prescribed by current ethical codes and international laws, is to anonymise and de-identify the data that research participants provide and to properly inform them about how the data will be used and by whom. However, anonymisation of data does not suffice to mitigate the risk for all data sets. As a result of technological advances and the availability of increasingly more digital data sets, anonymisation can be more easily undone, for instance by combining and integrating different data sets. Furthermore, in some cases, measures or strategies to preserve confidentiality can be reverse-engineered. In some research projects, anonymisation is not even possible because the data content enables identification and resists effective obfuscation. For instance, data from ethnographic studies of particular communities (e.g., transcribed interviews or field notes) may contain descriptions of practices and people that could be easily used to identify specific individuals. A concern is that the removal of identifying characteristics of research subjects may compromise the meaning, integrity and quality of the data. Even if effective anonymisation is technically feasible, research participants may still feel uncomfortable:
“The data that I collect is interviews with people. I have not engaged with open context because I deal with people who are doing illegal acts! Under my ethics review, I provide anonymity. To anonymise the data and to put it on an open context, adds another level. I doubt that many of my participants would agree if one of the stipulations was that I put my data into an open context”. (Ethical editorial reviewer, Archaeology)
A response to the privacy risks can be to explain to research participants the extent to which subsequent use can be effectively anticipated and to ensure, so far as possible, that the principles that apply to the governance of the data are consistent with the prevailing privacy expectations. False assurance, and associated drawbacks, may also be avoided by not “overpromising”, i.e., being transparent and realistic about the possibilities of re-identification. Asking consent can be problematic, because it difficult to predict how the data maybe used. Moreover, it may affect recruitment strategies and have additional affects on how researchers design and conduct their investigations. It might deter individuals, but it may also provide new opportunities to empower research participants. For instance, identifiable data provides participants with more possibilities of keeping track of their data, and may enable important additional outcomes. For example, some health research initiatives may find additional markers, symptoms or characteristics that locate particular research participants in high-risk categories for specific diseases. In these cases, ethical research practice encourages that the individual in question be contacted with this information, which would not be possible unless identifiable information was stored and shared. There are also ethical concerns in re-using data. You should always check what the limitations are on the use of data, but in some case it is difficult to know what the expectations are. For instance, in the case of publicly available information on for example social media sites. Someone posting on a website may not appreciate his or her comments being used for scientific research.
Fullerton and Lee have identified some ethically questionable secondary uses of data from the Human Genome Diversity Panel (HGDP). The HGDP is a collection that contains human tissue samples from 51 different human populations that were originally donated by multiple independent researchers over a period of years. The samples are archived together with geographic location and the sex of the individual from whom the sample was taken. Fuller and Lee reviewed the secondary uses of this collection and found that whereas the majority of studies were in line with the original intent of the collection, some published studies “could be regarded as controversial, objectionable or potentially stigmatizing in their interpretation”. One publication that they reviewed used samples from the HGDP to support the findings of a study that examined genetic signatures of Jewish Ancestry in European Americans, concluding that Jewish people are genetically distinct. Fullerton and Lee argue that such studies may cause indirect harm to participants, as they may support potentially unfavourable conclusions about populations from which participants were drawn. It may lead to discrimination or stigmatisation within populations or communities. Fullerton, Stephanie M, and Sandra S-J Lee, “Secondary uses and the governance of de-identified data: Lessons From the Human Genome Diversity Panel”, BMC Medical Ethics, Vol. 12, No. 16, 2011. http://www.biomedcentral.com/1472-6939/12/16 Ibid., p. 3.
In some instances the intended secondary use or misappropriation of research data may cause unacceptable damage or distress to individuals and groups, as well as to research and the scientific enterprise. It can harm or wrong research participants or other stakeholders, particularly when results are perceived to be manipulated or distorted or when data are used for purposes that research participants themselves find objectionable. An example is the secondary use of culturally sensitive samples and data, such as human remains. In particular, misinterpretation or misappropriation can offend communities and individuals. Unintended secondary use can damage identities, reputations and relationships between individuals, and may even endanger research subjects or sites. Well certainly there are a myriad of First Nations people who may feel offended or compromised if the raw materials related to religious locations, remains etc., are made publicly available and consumable in the wrong fashion. […] if you are putting native artefacts on display on line, information about them online, it really comes down to a whole hodgepodge of historic questions regarding how each particular tribal entity has been treated politically and also what their particular cultural feelings are about such matters. For groups that have less sensitivity about the remains of the deceased, you have to remember that these really represent scores of different cultural sensibilities. (Editorial reviewer, Archaeology) One concern is that the misinterpretation of publicly available medical health data by patients, for instance, can put these patients at risk. It often requires considerable knowledge and expertise to evaluate and interpret research data properly and to use it to decide on medical diagnosis and treatment. Unintended use can be particularly problematic when it involves personal data about research participants’ ethnic or racial origins, political opinions, sexuality, religious beliefs, criminal background, or physical or mental health. It may result in stigmatisation, discrimination or other kinds of harm. In addition, research participants may feel wronged or betrayed when their expectations about the use of their information do not match with intentions and practices of new studies. Dual use: Some data can be used for research that could produce knowledge, products or technologies that benefit society, but could also pose a threat to public health, agriculture, plants, animals, the environment or material. Such dual-use data present an ethical dilemma for data sharing and open access: do the benefits of providing access to research data outweigh the costs? Sharing data on a virus, for example, may facilitate research on an antidote, but people with ill intent may also use it to disrupt societies.
One of the better-known examples of the dual use dilemma was the publication of two manuscripts that reported on the details of laboratory experiments with the H5N1 avian flu virus. The manuscripts concluded that the virus had a greater potential to be transmitted between mammals, including humans, than previously thought. After various reviews, the journals Nature and Science decided to publish the articles, because they believed that the benefits of publishing outweighed the risks. After the publication scientists agreed to a one year moratorium “to provide time to explain the public-health benefits of this work, to describe the measures in place to minimise possible risks, and to enable organizations and governments around the world to review their policies (for example, on biosafety, biosecurity, oversight and communication) regarding these experiments”. Committee on Research Standards and Practices to Prevent the Destructive Application of Biotechnology, Biotechnology Research in an Age of Terrorism, National Research Council, The National Academies Press, Washington, DC, 2004. Royal Society, op. cit., 2012, p. 58 Fouchier, Ron A. M., Adolfo García-Sastre, Yoshihiro Kawaoka, and 37 co-authors, “H5N1 Virus: Transmission Studies Resume for Avian Flu”, Nature, Vol. 493, No. 609, 31 January 2013. DOI:10.1038/nature11858
Representatives from sequencing centres around the world meet in Bermuda to draft a set of principles for free and rapid access to Human Genome Project data. These become known as the “Bermuda Principles”.
When the law
Licensing provides a useful way to address intellectual property issues as well as ethical issues such as commercialisation and misappropriation and misuse of scientific information. These licensing models include Creative Commons licenses (as the most commonly employed forms of licensing), and other Licenses such as Government Open Licenses. The creators of research data and/or the repositories in which they are stored may make use of licenses to establish clear conditions related to how the research data should be used, including, for example, attributing content to original researchers and restrictions on modifying data. However, whilst licensing presents solutions such as important protections for stakeholders, it also introduces pitfalls such as the availability of licensed material that does not fully comply with the European Commission’s definition of open access. Irrespective of this, licensing continues to be a commonly employed practical solution in the move towards open access to research data. Archaeology, physics and clinical data all require some form of professional accreditation or other access management review in order to enable researchers to access data. This professional gate-keeping solution allows these disciplines to manage legal and ethical compliance in relation to open access to research data. Specifically, they serve to identify true “professionals” who will have expertise in research methods or legal requirements such as confidentiality, privacy, data protection and research ethics. This solution ensures that the data is used responsibly and any potential issues associated with misuse are identified and mitigated. It also serves as a mechanism for enforcement, whereby individuals who do not use data responsibly may not be “approved” a second time.
In relation to Bioengineering, a large, multi-national data bank of biological material uses the following strategy: They would identify from our website, which data sets they want, because the data sets are listed, […] They would write the access request email, which is on our website […] And they would say, ‘I&apos;m interested in these data, can you pass me onto the relevant data owners.’ And then there is a form to fill out, which might not be the same for each dataset. And you basically have to explain who you are and why you would to use the data and what you want to use it for, and that’s just passed onto the data owners. And if the committee says yes, then we give them access. So the data are encrypted, so we would send a creater user password. (Scientific services manager, Bioengineering).
These processes are particularly effective in meeting requirements around intellectual property rights, data protection, secondary or dual use of research material and commercialisation. Therefore, it prevents unethical usage of the research data and aims to achieve and maintain legal compliance. , the use of editorial review mechanisms emerges as a useful tool in ensuring ethical data practice and legal compliance. Internal processes have been adopted amongst our case study participants as a solution to the publication of research data that may have resulted from unethical practices and/ or in a manner that may be contrary to applicable laws. However, the editorial review solution may also introduce new pitfalls not so dissimilar to those associated with access management as described above. By way of specific example, Open Context adhere to an editorial review process that involves participation from local governments:
So what we do is, before it even goes to open context, our data go through a cleaning process, where the sites are allocated to a grid in the grid system and then we scrub the coordinate data and any data that are considered sensitive by our state partners, which can potentially differ state to state, and then we put it up on open context. So the only location information relates to our grid. (Editorial reviewer, Archaeology) Finally, the use of existing ethical and legal guidance instruments, such as checklists or professional codes of conduct are also employed by our case study participants as a solution to assist stakeholders in effectively evaluating their responsibilities. However, soft-law measures carry the potential for pitfalls to the extent that although they encourage ethical practices and legal compliance, they do not mandate them.
An ethical editorial reviewer in the Archaeology case study explains the adoption of soft-law measures by their organisation:
[W]e take a lot of our clues on the ethical front from various journals and other kinds of venues where people publish this kind of material routinely and most journals and publishing houses have ethical guidelines that they follow. And we look to them sometimes for clues, because it’s quite similar in many ways.
Ethical and legal issues in making research data open
Ethical and legal
Ethical concerns – Part II
• Unintended consequences and misinterpretation
• Publicly available data: social media
• Dual use
Human Genome Project
1. Automatic release of sequence assemblies
larger than 1 kb (preferably within 24 hours).
2. Immediate publication of finished annotated
3. Aim to make the entire sequence freely
available in the public domain for both
research and development in order to
maximise benefits to society.
• Intellectual Property Rights (IPR)
• Data protection
• Open Access legislation
Intellectual Property Rights (IPR)
• Trade Secret
• Database rights
• Different kinds of licenses
• Creative Commons
• Personal data
• Sensitive personal data
• Pseudonymised data
Eight principle of data protection
• Fair and lawful
How to meet data protection directive
• Personal genome project (US and UK)
Open Access legislation
• Public sector information
• National legislation