Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Sharing Confidential
Data
George Alter
University of Michigan
Disclosure: Risk & Harm
• What do we promise when we conduct
research about people?
– That benefits (usually to society) o...
Who are We Afraid of?
• Parents trying to find out if their child had an
abortion or uses drugs
• Spouse seeking hidden in...
What are We Afraid of...
• Direct Identifiers
– Inadvertent release of unnecessary
information (Name, phone number, SSN…)
...
Deductive Disclosure
• A combination of characteristics could
allow an intruder to re-identify an
individual in a survey “...
Deductive Disclosure
Contextual data increases the risk of disclosure
– Some attributes can be known by an outsider (age,
...
Contextual data in social science
research
Geographic context
• Neighborhood characteristics, economic
conditions, health ...
Current Survey Designs Increase the
Risks of Disclosing Subjects’ Identities
• Geographically referenced data
• Longitudin...
Protecting Confidential Data
• Safe data: Modify the data to reduce the risk
of re-identification
• Safe projects: Reviewi...
Safe data
Disclosure risks can be reduced by:
• Multiple sites rather than single locations
• Keeping sampling locations s...
Safe Data
Data masking
• Grouping values
• Top-coding
• Aggregating geographic areas
• Swapping values
• Suppressing uniqu...
Safe Projects
• Research plans are reviewed before access is
approved
• Levels of project review
1. Does the research plan...
Safe Settings
• Data protection plans
• Remote submission and execution
• Virtual data enclave
• Physical enclave
Data Protection Plans should address risks:
• unauthorized use of account on computer
• computer break-in by exploiting vu...
Improving Data Security Plans
• Problems
– PIs lack technical expertise
– Requirements are inconsistent and confusing
– Mo...
• Remote submission and execution
– User submits program code or scripts, which
are executed in a controlled environment
•...
Virtual Data Enclave
The Virtual Data Enclave (VDE) provides remote
access to quantitative data in a secure environment.
Safe people
• Data use agreements
• Training
Safe people
• Parts of a data use agreement at ICPSR
– Research plan
– IRB approval
– Data protection plan
– Behavior rule...
Informed
Consent
Interview
Data producer
Data archive
Researcher
Data Use
Agreement
Institution
Dataflow
Data
Disseminatio...
Data Use Agreement: Behavior rules
To avoid inadvertent disclosure of persons, families, households,
neighborhoods, school...
Data Use Agreement
The Recipient Institution will treat allegations, by
NAHDAP/ICPSR or other parties, of violations of
th...
Problems with DUAs
• DUAs are issued by project.
– Every PI gets a new DUA, even if the Institution
has already signed the...
Reducing the costs of DUAs
• Institution-wide agreements
– One agreement per institution, not per project
– A designated “...
Disclosure: Graph with extreme values example
no
Arrested in last year?
yes
Data were collected for a sample of 104 people...
• Controlled environments allow review of
outputs
o Remote submission and execution
o Virtual data enclaves
o Physical enc...
Weighing Costs and Benefits
• Data protection has costs
– Modifying data affects analysis
– Access restrictions impose bur...
Gradient of Risk &
Restriction
SeverityofHarm
Probability of Disclosure
Tiny Risk
Web
Access
Some Risk
Data Use
Agreement
...
Thank you
George Alter
University of Michigan
altergc@umich.edu
What if databases could send data to a trusted
third party, who would compute statistics?
Database 1 Database 2
Secure Mul...
Average Income
Three people with true salaries S1, S2, S3, which
they never reveal.
Each computes random numbers Rij (sent...
Upcoming SlideShare
Loading in …5
×

Sharing Confidential Data in ICPSR

195 views

Published on

Prof George Alter, UMich, ICPSR, presenting at the Managing and publishing sensitive data in the Social Sciences webinar on 29/3/17.

FULL webinar recording: https://youtu.be/7wxfeHNfKiQ

Webinar description:
2) Prof George Alter, (Research Professor, ICPSR and Visiting Professor, ANU) George will share the benefit of over 50 years of experience in managing sensitive social science data in the ICPSR: https://www.icpsr.umich.edu/icpsrweb/

More about ICPSR: -- ICPSR (USA) maintains a data archive of more than 250,000 files of research in the social and behavioral sciences. It hosts 21 specialized collections of data in education, aging, criminal justice, substance abuse, terrorism, and other fields. -- ICPSR collaborates with a number of funders, including U.S. statistical agencies and foundations, to create thematic collections: see https://www.icpsr.umich.edu/icpsrweb/content/about/thematic-collections.html

Published in: Education
  • Be the first to comment

Sharing Confidential Data in ICPSR

  1. 1. Sharing Confidential Data George Alter University of Michigan
  2. 2. Disclosure: Risk & Harm • What do we promise when we conduct research about people? – That benefits (usually to society) outweigh risk of harm (usually to individual) – That we will protect confidentiality • Why is confidentiality so important? – Because people may reveal information to us that could cause them harm. – Examples: criminal activity, antisocial activity, medical conditions...
  3. 3. Who are We Afraid of? • Parents trying to find out if their child had an abortion or uses drugs • Spouse seeking hidden income or infidelity in a divorce • Insurance companies seeking to eliminate risky individuals • Other criminals and nuisances • NSA, CIA, FBI, KGB, SABOT, SBL, SMERSH, KAOS, etc...
  4. 4. What are We Afraid of... • Direct Identifiers – Inadvertent release of unnecessary information (Name, phone number, SSN…) – Direct identifiers required for analysis (location, genetic characteristics,…) • Indirect Identifiers – Characteristics that identify a subject when combined (sex, race, age, education, occupation)
  5. 5. Deductive Disclosure • A combination of characteristics could allow an intruder to re-identify an individual in a survey “deductively,” even if direct identifiers are removed. • Dependent on – Knowing someone in the survey – Matching cases to a database
  6. 6. Deductive Disclosure Contextual data increases the risk of disclosure – Some attributes can be known by an outsider (age, race) – Individuals are more identifiable in smaller populations • The more specific the geography, the more attention must be paid to disclosure risk.
  7. 7. Contextual data in social science research Geographic context • Neighborhood characteristics, economic conditions, health services, distance to resources, etc. Institutional context • School • Hospital • Prison
  8. 8. Current Survey Designs Increase the Risks of Disclosing Subjects’ Identities • Geographically referenced data • Longitudinal data • Multi-level data: – Student, teacher, school, school district – Patient, clinic, community
  9. 9. Protecting Confidential Data • Safe data: Modify the data to reduce the risk of re-identification • Safe projects: Reviewing research designs • Safe settings: Physical isolation and secure technologies • Safe people: Training and Data use agreements • Safe outputs: Results are reviewed before being released to researchers
  10. 10. Safe data Disclosure risks can be reduced by: • Multiple sites rather than single locations • Keeping sampling locations secret – Releasing characteristics of contexts without providing locations • Oversampling rare characteristics
  11. 11. Safe Data Data masking • Grouping values • Top-coding • Aggregating geographic areas • Swapping values • Suppressing unique cases • Sampling within a larger data collection • Adding “noise” • Replacing real data with synthetic data
  12. 12. Safe Projects • Research plans are reviewed before access is approved • Levels of project review 1. Does the research plan require confidential data? 2. Would the research plan identify individual subjects? 3. Is the research scientifically sound? Does it “serve the public good”? • Scientific review requires standards and expertise
  13. 13. Safe Settings • Data protection plans • Remote submission and execution • Virtual data enclave • Physical enclave
  14. 14. Data Protection Plans should address risks: • unauthorized use of account on computer • computer break-in by exploiting vulnerability • hijacking of computer by malware or botware • interception of network traffic between computers • loss of computer or media • theft of computer or media • eavesdropping of electronic output on computer screen • unauthorized viewing of paper output We often focus too much on technology and not enough on risk. Safe Settings
  15. 15. Improving Data Security Plans • Problems – PIs lack technical expertise – Requirements are inconsistent and confusing – Monitoring compliance is expensive • An alternative: Institution-level data security protocols – Tiered guidelines for different levels of risk – Focus on mitigating risks not specifying technologies – Certification of researchers – Institutional oversight
  16. 16. • Remote submission and execution – User submits program code or scripts, which are executed in a controlled environment • Virtual data enclave – Remote desktop technology prevents moving data to user’s local computer – Requires a data use agreement • Physical enclave – Users must travel to the data Safe Settings
  17. 17. Virtual Data Enclave
  18. 18. The Virtual Data Enclave (VDE) provides remote access to quantitative data in a secure environment.
  19. 19. Safe people • Data use agreements • Training
  20. 20. Safe people • Parts of a data use agreement at ICPSR – Research plan – IRB approval – Data protection plan – Behavior rules – Security pledge – Institutional signature
  21. 21. Informed Consent Interview Data producer Data archive Researcher Data Use Agreement Institution Dataflow Data Dissemination Agreement Research Plan IRB Approval Data Protection Plan
  22. 22. Data Use Agreement: Behavior rules To avoid inadvertent disclosure of persons, families, households, neighborhoods, schools or health services by using the following guidelines in the release of statistics derived from the dataset. 1. In no table should all cases in any row or column be found in a single cell. 2. In no case should the total for a row or column of a cross- tabulation be fewer than ten. 3. In no case should a quantity figure be based on fewer than ten cases. 4. In no case should a quantity figure be published if one case contributes more than 60 percent of the amount. 5. In no case should data on an identifiable case, or any of the kinds of data listed in preceding items 1-3, be derivable through subtraction or other calculation from the combination of tables released.
  23. 23. Data Use Agreement The Recipient Institution will treat allegations, by NAHDAP/ICPSR or other parties, of violations of this agreement as allegations of violations of its policies and procedures on scientific integrity and misconduct. If the allegations are confirmed, the Recipient Institution will treat the violations as it would violations of the explicit terms of its policies on scientific integrity and misconduct.
  24. 24. Problems with DUAs • DUAs are issued by project. – Every PI gets a new DUA, even if the Institution has already signed the DUA for someone else • Language and conditions in DUAs are not standard – Frequent negotiations and lawyering
  25. 25. Reducing the costs of DUAs • Institution-wide agreements – One agreement per institution, not per project – A designated “data steward” adds qualified researchers to the agreement – Example: Databrary Agreement • Covers informed consent, data sharing, data use • Researcher certification covering multiple datasets
  26. 26. Disclosure: Graph with extreme values example no Arrested in last year? yes Data were collected for a sample of 104 people in a county. Among the variables collected were age, gender, and whether the person was arrested within the last year. Box plots below show the distribution of age, one plot for those arrested and one for those who were not. The number labels are case number in the dataset. The potential identifiability represented by outlying values is compounded here by an unusual combination that could probably be identified using public records for a county in the U.S. --someone approximately 90 years old was arrested in the sample. Including extreme values is a disclosure risk for identifiability when combined with other variables in the dataset. N 104 min age 12 max age 95 mean age 51 std dev 15 % female 5.2 % arrested 5.8 Safe People: Disclosure risk online tutorial
  27. 27. • Controlled environments allow review of outputs o Remote submission and execution o Virtual data enclaves o Physical enclaves • Disclosure checks may be automated, but manual review is usually necessary Safe outputs
  28. 28. Weighing Costs and Benefits • Data protection has costs – Modifying data affects analysis – Access restrictions impose burdens on researchers • Protection measures should be proportional to risks – Probability that an individual can be (re-)identified – Severity of harm resulting from re-identification
  29. 29. Gradient of Risk & Restriction SeverityofHarm Probability of Disclosure Tiny Risk Web Access Some Risk Data Use Agreement Moderate Risk - Strong DUA & Technology Rules High Risk Enclosed Data Center Simple Data: minimal harm & very low chance of disclosure Complex Data: low harm & low probability of disclosure Complex data: moderate harm & re-identifiable with difficulty High severity of harm & highly identifiable
  30. 30. Thank you George Alter University of Michigan altergc@umich.edu
  31. 31. What if databases could send data to a trusted third party, who would compute statistics? Database 1 Database 2 Secure Multi-Party Computing MPC does this without the third party. Encryption
  32. 32. Average Income Three people with true salaries S1, S2, S3, which they never reveal. Each computes random numbers Rij (sent from i to j). Report salary plus own random numbers minus those received, i.e., X1 = S1 + (R12 + R13) – (R21 + R31) X2 = S2 + (R21 + R23) – (R12 + R32) + X3 = S3 + (R31 + R32) – (R13 + R23) Σ = S1 + S2 + S3 Example from Daniel Goroff, Alfred P. Sloan Foundation Homomorphic Encryption

×