1. OPEN DATA AND PRIVACY IN THE HUMANITIES
(AND ARTS AND SOCIAL SCIENCES)
Helen Cullyer, Program Officer
Scholarly Communications
The Andrew W. Mellon Foundation
2. OPEN DATA
Available for universal reuse and redistribution (though
some open licenses do prevent commercial use)
Promotes transparency, reproducibility; advances research;
and makes results of scholarly inquiry available to the
public, policymakers in addition to other scholars
3. DATA IN THE HUMANITIES, ARTS, AND
HUMANISTIC SOCIAL SCIENCES
Digitized and born-digital primary source collections (text,
image, multimedia) and their associated metadata
Transcriptions and annotations
Survey or other data collected by researchers
Data that results from computational analysis of digital
collections and raw datasets
4. OPEN KNOWLEDGE FOUNDATION
ON OPEN DATA AND PRIVACY:
Our Data is data with no personal element, and a clear sense of shared ownership.
Some examples would be where the buses run in my city, what the government decides
decides to spend my tax money on, how the national census is structured and the
the aggregate data resulting from it. At the Open Knowledge Foundation, our default
default position is that our data should be open data – it is a shared asset we can and
can and should all benefit from.
My Data is information about me personally, where I am identified in some way, regardless of who
collects it. It should not be made open or public by others without my direct permission – but it
should be “open” to me (I should have access to data about me in a useable form, and the right
to share it myself, however I wish if I choose to do so).
Transformed Data is information about individuals, where some effort has been made to
anonymise or aggregate the data to remove individually identified elements.
http://personal-data.okfn.org/2013/12/13/open-data-privacy/
5. IS THERE REALLY A PROBLEM?
Can’t you just anonymize and aggregate data
and make those data openly accessible?
• Anonymized data are not necessarily de-identified
data
• Aggregated data are not always granular
enough in the humanities and humanistic social
sciences
6. SOME QUESTIONS
• What types of transformations can be used to de-identify but retain
the granularity and usefulness of data so that they can be made
open?
• Does the push for open data distract us from the need to craft
careful and multi-level access policies for certain data types?
• Is there a danger that in trying to make general rules and policies,
we will ignore the differences among particular cases and: either
(a) develop requirements that are too lax and endanger privacy; or
(b) place unnecessary restrictions on data that could and should be
open?
7. EXAMPLE #1:
RECORDS OF THE CENTRAL LUNATIC ASYLUM FOR THE COLORED INSANE
(King Davis, University of Texas at Austin)
• Digitized organizational and medical records dating from 1868 through
1967
• Of interest to scholars in a number of fields (history, history of science and
medicine, African American Studies) and to the families of former patients
• Privacy challenges: HIPAA regulations; state law; IRB regulations; and a
host of ethical concerns
• What sort of access to different types of data will be given to different
groups? How will access mechanisms for digital data be implemented?
8. EXAMPLE #2:
SUBSCRIBERS TO THE NEW YORK PHILHARMONIC
(Shamus Khan, Columbia; and Barbara Haws, NY Philharmonic)
• Digitized and born-digital subscriber records (1842 to the present) that
contain names and addresses
• Columbia researchers transcribing records and augmenting them with
other publicly available data (egs. census data, information from New
York Social Register)
• All names post-1953 are redacted in Columbia data
• What to share openly, and how, of the post-1953 data?
• NY Phil working on privacy and access policies for post-1953 archival
records that they hold
9. EXAMPLE #3:
EXCAVATING EPORTFOLIOS:
DIGGING INTO A DECADE OF STUDENT-DRIVEN DATA
Amanda Licastro, CUNY Graduate Center, @amandalicastro
http://digitocentrism.commons.gc.cuny.edu/
• Data includes large sample of student writing (from publicly available
WordPress eportfolio sites); anonymous survey responses; interviews
• All private sites and private posts stripped out of eportfolios. No grading
information or other official student records included within data
• Results of computational analysis will be published
• Raw data must be encrypted according to IRB requirements
10. SOME NEEDS:
• Technical help and mentorship regarding data management
• “…examples of data management plans and workflow
sequences for large data projects that could serve as instructive
models for humanists like myself. And this work would be done
best in a collaborative maker space where scholars from across
the disciplines could have designated sessions where we could
trouble-shoot together.”
Amanda Licastro
11. ARE IRB REQUIREMENTS TOO STRICT IN MANY CASES?
See summary of recent National Academies report:
“To first determine if research activities fall within the scope of the Common Rule, the report
recommends that HHS define “human subjects research” as a systematic investigation designed
to develop or contribute to generalizable knowledge that involves direct interaction or
intervention with a living individual or that involves obtaining identifiable private information
about an individual. Only research that fits this definition should be subject to IRB procedures
and the Common Rule.
Building on this definition, HHS should also clarify that research which relies on publicly available
information, information in the public domain, or information that can be observed in public
contexts does not meet the definition of human subjects research -- regardless of whether the
information is personally identifiable -- as long as individuals whose information is used have no
reasonable expectation of privacy. This includes digital data, some types of administrative
records, and public-use data files...”
http://www8.nationalacademies.org/onpinews/newsitem.aspx?RecordID=1
8614
12. THE COMPLICATED HUMAN SUBJECTS PICTURE
• Does the research involve personally identifiable data? Is the
source data publicly available anyway?
• Would the research involve presentation of data (digital or
otherwise) to human subjects with the intention of deceiving or
manipulating those subjects?
• What is the potential for harm in the research itself and
dissemination of data?
For an excellent account of the current contested landscape, see Christopher Shea,
“New Rules for Human-Subject Research are Delayed and Debated”,
http://chronicle.com/article/New-Rules-for-Human-
Subject/149767/?cid=wc&utm_source=wc&utm_medium=en
13. PRELIMINARY CONCLUSIONS
• The dichotomy between open / non-open is, in many cases, a false one.
There are plenty of data types and versions of data that cannot be made
fully open yet can be shared with a limited group of individuals in carefully
controlled ways
• We need to be worried about privacy regulations and policies that are
too stringent as well as those that are too lax
• What is a “reasonable expectation of privacy” in a networked
environment?
• Need to generate robust regulations and policies, at a high level of
generality, that both protect privacy and allow for collaborative and
thoughtful discussion about what is appropriate in particular cases
14. FINAL THOUGHTS
Utilitarian approach: Quantify the risk of harm
Deontological (Kantian) approach: “Always do X”, “Never do Y”
Aristotelian particularist approach: The standard of judgment is the reasonable
person
But how do we generate the “robust regulations and policies, at a high
level of generality…” within which reasonable persons act? At what
level of generality should those laws and policies function?
A rich typology of research projects that involve personal, identifiable
data is needed