Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Micah Altman NISO privacy in library systems

1,643 views

Published on

Micah Altman, NISO patron privacy initiative, Meeting #1, May 7, 2015: http://www.niso.org/topics/tl/patron_privacy/

Published in: Education
  • Be the first to comment

  • Be the first to like this

Micah Altman NISO privacy in library systems

  1. 1. NISO Lightning Overview: Identification & “Anonymization” Micah Altman Director of Research MIT Libraries Prepared for NISO Workshop on Patron Privacy Online May 2015
  2. 2. DISCLAIMER These opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators Secondary disclaimer: “It’s tough to make predictions, especially about the future!” -- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc. Lightning Overview: Identification & “Anonymization” 2
  3. 3. Collaborators & Co-Conspirators  Privacy Tools for Sharing Research Data Team (Salil Vadhan, P.I.) http://privacytools.seas.harvard.edu/people  Research Support Supported in part by NSF grant CNS-1237235 Lightning Overview: Identification & “Anonymization” 3
  4. 4. Related Work Main Project:  Privacy Tools for Sharing Research Data http://privacytools.seas.harvard.edu/ Related publications:  Novak, K., Altman, M., Broch, E., Carroll, J. M., Clemins, P. J., Fournier, D., Laevart, C., et al. (2011). Communicating Science and Engineering Data in the Information Age. Computer Science and Telecommunications. National Academies Press  Vadhan, S., et al. 2011. “Re: Advance Notice of Proposed Rulemaking: Human Subjects Research Protections.”  Altman, M., D. O’Brien, S. Vadhan, A. Wood. 2014. “Big Data Study: Request for Information.”  O'Brien, et al. 2015. “When Is Information Purely Public?” (Mar. 27, 2015) Berkman Center Research Publication No. 2015-7.  Wood, et al. 2014. “Long-Term Longitudinal Studies” (July 22, 2014). Berkman Center Research Publication No. 2014-12. Slides and reprints available from: informatics.mit.edu Lightning Overview: Identification & “Anonymization” 4
  5. 5. Identifiable private information is common  Birth date + zipcode + gender uniquely identify ~87% of people in the U.S.  Can predict social security number using birthdate/place  Tables, graphs and maps can reveal identifiable information  People have been identified through movie rankings, search strings, writing style… Brownstein, et al., 2006 , NEJM 355(16), 5 Lightning Overview: Identification & “Anonymization”
  6. 6. Privacy is not Confidentiality… (defining basic terms)  Privacy Control over extent and circumstances of sharing  Confidentiality Control of disclosure information  Sensitive information Information that would cause harm if improperly disclosed (to individual, institution, social group, or society)  Private personally identifiable information  Not already purely public  Directly or indirectly linkable to an identifiable individual  Possibly using externally available information 6 Lightning Overview: Identification & “Anonymization”
  7. 7. Legal Constraints are Complicated Contract Intellectual Property Access Rights Confidentiality Copyrigh t Fair Use DMCA Database Rights Moral Rights Intellectua l Attribution Trade Secret Patent Trademark Common Rule 45 CFR 26HIPA AFERP A EU Privacy Directive Privacy Torts (Invasion, Defamation) Rights of Publicity Sensitive but Unclassified Potentially Harmful (Archeologica l Sites, Endangered Species, Animal Testing, …) Classifie d FOIA CIPSE A State Privacy Laws EA R State FOI Laws Journal Replication Requirements Funder Open Access Contract License Click-Wrap TOU ITA Export Restriction s Lightning Overview: Identification & “Anonymization” 7
  8. 8. Laws define “anonymized” differently FERPA HIPAA Common Rule MA 201 CMR 17 Identificatio n Criteria - Direct - Indirect - Linked - Bad intent - direct/indirect: 18 identifier - OR statistician verifies minimal risk AND no actual knowledge of identified indiviual - Direct - Indirect / Linked -- if “readily identifiable” -First Initial + Last Name Sensitivity Criteria Any non- directory information Any medical information Private information – based on harm Financial, State, Federal Identifiers 8 Lightning Overview: Identification & “Anonymization”
  9. 9. Different definitions of identifiability Lightning Overview: Identification & “Anonymization” 9 Record-linkage • “where’s waldo” • Match a real person to precise record in a database • Examples: direct identifiers. • Caveats: Satisfies compliance for specific laws, but not generally; substantial potential for harm remains Indistinguishability + Heterogeneity • “hiding in the crowd” • People can be matched only to cluster of records • Based on quasi-ids • Sensitive attributes must also vary • Examples: K-anonymity, l- diversity, attribute disclosure • Caveats: Potential for substantial harms may remain Learning • “privacy, guaranteed” • Formally bound the total learning about any individual that can occur from a query • Examples: differential privacy, zero-knowledge proofs • Caveats: Challenging to implement, requires interactive system
  10. 10. How many things are wrong with this picture? Name SSN Birthdate Zipcode Gender Favorite Ice Cream # of crimes committed A. Jones 12341 01011961 02145 M Raspberr y 0 B. Jones 12342 02021961 02138 M Pistachio 0 C. Jones 12343 11111972 94043 M Chocolat e 0 D. Jones 12344 12121972 94043 M Hazelnut 0 E. Jones 12345 03251972 94041 F Lemon 0 F. Jones 12346 03251972 02127 F Lemon 1 G. Jones 12347 08081989 02138 F Peach 1 H. Smith 12348 01011973 63200 F Lime 2 I. Smith 12349 02021973 63300 M Mango 4 J. Smith 12350 02021973 63400 M Coconut 16 K. Smith 12351 03031974 64500 M Frog 32 L. Smith 12352 04041974 64600 M Vanilla 64 M. Smith 12353 04041974 64700 F Pumpkin 128 N. Smi th- 12354 04041974 64800 F Allergic 256 10 Lightning Overview: Identification & “Anonymization”
  11. 11. Name SSN Birthdate Zipcode Gender Favorite Ice Cream # of crimes committed A. Jones 12341 01011961 02145 M Raspberr y 0 B. Jones 12342 02021961 02138 M Pistachio 0 C. Jones 12343 11111972 94043 M Chocolat e 0 D. Jones 12344 12121972 94043 M Hazelnut 0 E. Jones 12345 03251972 94041 F Lemon 0 F. Jones 12346 03251972 02127 F Lemon 1 G. Jones 12347 08081989 02138 F Peach 1 H. Smith 12348 01011973 63200 F Lime 2 I. Smith 12349 02021973 63300 M Mango 4 J. Smith 12350 02021973 63400 M Coconut 16 K. Smith 12351 03031974 64500 M Frog 32 L. Smith 12352 04041974 64600 M Vanilla 64 M. Smith 12353 04041974 64700 F Pumpkin 128 N. Smith 12354 04041974 64800 F Allergic 256 What’s wrong with this picture? Identifier Sensitive Private Identifier Private Identifier Identifier Sensitive Unexpected Response? Mass resident FERPA too? Californian Twins, separated at birth? 11 Lightning Overview: Identification & “Anonymization”
  12. 12. Common Approach: Suppress Information for Data Release Published Outputs * Jones * * 1961 021* * Jones * * 1961 021* * Jones * * 1972 9404* * Jones * * 1972 9404* * Jones * * 1972 9404* Modal Practice “The correlation between X and Y was large and statistically significant” Summary statistics Contingency table Public use sample microdata Information Visualization Lightning Overview: Identification & “Anonymization” 12
  13. 13. Help, help, I’m being suppressed… Name SSN Birthdate Zipcode Gender Favorite Ice Cream # of crimes committed [Name 1] 1234 1 *1961 021* M Raspberry .1 [Name 2] 1234 2 *1961 021* M Pistachio -.1 [Name 3] 1234 3 *1972 940* M Chocolate 0 [Name 4] 1234 4 *1972 940* M Hazelnut 0 [Name 5] 1234 5 *1972 940* F Lemon .6 [Name 6] 1234 6 *1972 021* F Lemon .6 [Name 7] 1234 7 *1989 021* * Peach 64.6 [Name 8] 1234 8 *1973 632* F Lime 3 [Name 9] 1234 9 *1973 633* M Mango 3 Row VarSynthetic Global Recode Local Suppression Aggregation + Perturbation Traditional Static Suppression  Data reduction  Observation  Measure  Cell  Perturbation  Microaggregation  Rule-based data swapping  Adding noise 13 Lightning Overview: Identification & “Anonymization”
  14. 14. Suppression reduces utility Lightning Overview: Identification & “Anonymization” 14  Common approach of anonymizing/suppressing data reduces usefulness  Minimizing disclosure in the presence of large external data sources reduces usefulness a lot  Anonymized data is not simply less informative -- it typically yields biased analyses
  15. 15. New Data – New Challenges  How to deidentify without completely destroying the data?  The “Netflix Problem”: large, sparse datasets that overlap can be probabilistically linked [Narayan and Shmatikov 2008]  The “GIS”: fine geo-spatial-temporal data impossible mask, when correlated with external data [Zimmerman 2008; ]  The “Facebook Problem”: Possible to identify masked network data, if only a few nodes controlled. [Backstrom, et. al 2007]  The “Blog problem” : Pseudononymous communication can be linked through textual analysis [Novak wet. al 2004] [For more examples see Vadhan, et al 2010] Source: [Calberese 2008; Real Time Rome Project 2007] 15 Lightning Overview: Identification & “Anonymization”
  16. 16. Little Data – Big World  The “Favorite Ice Cream” problem -- public information that is not risky can help us learn information that is risky  The “Doesn’t Stay in Vegas” problem -- information shared locally can be found anywhere  The “Data Exhaust problem” -- wherever you go, there you are, and your data too! Lightning Overview: Identification & “Anonymization” 16
  17. 17. Algorithmic Discrimination Lightning Overview: Identification & “Anonymization” • Emergent behavior of algorithms, big data, and behavior  discrimination on private personal characteristics 17
  18. 18. Information Science Approach: Manage Privacy & Confidentiality Lifecycle Lightning Overview: Identification &   Collection:  Consent/licensing terms  Methods  Measures  Storage  Systems information security  Data structures and partitioning  Dissemination  Vetting  Disclosure limitation  Data use agreements Creation/C ollection Storag e/Inge st Processing Internal Sharing Analysi s External dissemination/pu blication Re-use Long- term access Researc h methods Data Management Systems Legal / Policy Frameworks ∂∂ Statistical / Computational Frameworks 18
  19. 19. Hybrid Approaches  Collection limitations  Limitations on collection  Inform and consent  Data enclaves – physically restrict access to data  Examples: ICPSR, Census Research Data Center  May include availability of synthetic data as an aid to preparing model specifications  Advantages: extensive human auditing, vetting; information security threats much reduced  Disadvantages: expensive, slow, inconvenient to access  Controlled remote access  Varies from remote access to all data and output to human vetting of output  Restrictions on use, easier to enforce  Advantages: auditable, potential to impose human review, potential to limit analysis  Disadvantages: complex to implement, slow  Model servers  Mediated remote access – analysis limited to designated models  Advantages: faster, no human in loop  Disadvantage: statistical methods for ensuring model safety are immature – residuals, categorical variables, dummy variables are all risky; very limited set of models currently supported; complex to implement  Experimental approaches  Personal Data Stores  Data Auditing and Accountability 19 Lightning Overview: Identification & “Anonymization”
  20. 20. Questions? Web: informatics.mit.edu 20 Lightning Overview: Identification & “Anonymization”
  21. 21. Creative Commons License This work. Managing Confidential information in research, by Micah Altman (http://redistricting.info) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by- sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. 21 Lightning Overview: Identification & “Anonymization”

×